smcameron / space-nerds-in-space

Multi-player spaceship bridge simulator game. Captain your starship through adventures with your friends. See https://smcameron.github.io/space-nerds-in-space
GNU General Public License v2.0
728 stars 73 forks source link

Need to squash all sources of NaNs in the code #236

Open smcameron opened 4 years ago

smcameron commented 4 years ago

Found a variety of NaNs popping up in the code, and there are still some more left.

See:

This patch can help root them out:

Author: Stephen M. Cameron <stephenmcameron@gmail.com>
Date:   Sat Nov 16 18:52:38 2019 -0500

    debug client nan

    Signed-off-by: Stephen M. Cameron <stephenmcameron@gmail.com>

diff --git a/snis_client.c b/snis_client.c
index 84d2377..8b1b65e 100644
--- a/snis_client.c
+++ b/snis_client.c
@@ -52,6 +52,7 @@
 #include <netinet/tcp.h>
 #include <netinet/in.h>
 #include <getopt.h>
+#include <fenv.h>

 #include "arraysize.h"
 #include "build_bug_on.h"
@@ -22846,6 +22847,8 @@ int main(int argc, char *argv[])
        GtkWidget *vbox;
        int i;

+       feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);
+
        refuse_to_run_as_root("snis_client");
        displaymode = DISPLAYMODE_NETWORK_SETUP;

diff --git a/snis_server.c b/snis_server.c
index a289bcc..732977a 100644
--- a/snis_server.c
+++ b/snis_server.c
@@ -51,6 +51,7 @@
 #include <locale.h>
 #include <dirent.h>
 #include <signal.h>
+#include <fenv.h>

 #include "arraysize.h"
 #include "container-of.h"
@@ -28824,6 +28825,8 @@ int main(int argc, char *argv[])
        int port, i;
        struct timespec thirtieth_second;

+       feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);
+
        refuse_to_run_as_root("snis_server");
        take_your_locale_and_shove_it();

One commit of interest that I noticed:

Would probably be better to fix the derelict generation code to not produce degenerate triangles.

smcameron commented 4 years ago

So many NaNs! How does this thing work so good with so many NaNs!

And one more that I haven't figured out yet in the main render loop

Core was generated by `.//bin/snis_client --lobbyhost localhost --fullscreen --trap-nans'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x0000000000447198 in mat33_inverse_transpose_ff (src=0x7ffc0312ea60, output=0x7ffc0312ecc0) at matrix.c:98
98      output->m[0][0] = (+(src->m[1][1] * src->m[2][2] - src->m[2][1] * src->m[1][2])) / determinant;
[Current thread is 1 (Thread 0x7fc5b33c7a00 (LWP 10964))]
(gdb) bt
#0  0x0000000000447198 in mat33_inverse_transpose_ff (src=0x7ffc0312ea60, output=0x7ffc0312ecc0) at matrix.c:98
#1  0x0000000000442cd2 in calculate_model_matrices (c=0x441c050, f=0x7ffc0312edf0, e=0x7fc594545968, transform=0x7ffc0312eb40) at entity.c:420
#2  0x000000000044379b in render_entity (cx=0x441c000, f=0x7ffc0312edf0, e=0x7fc594545968, camera_light_pos=0x7ffc0312ede0) at entity.c:533
#3  0x0000000000445a95 in render_entities (cx=0x441c000) at entity.c:1065
#4  0x000000000048687f in show_mainscreen (w=0x423b0a0) at snis_client.c:9391
#5  0x00000000004b450d in main_da_expose (w=0x423b0a0, event=0x7ffc0312f980, p=0x0) at snis_client.c:20431
#6  0x00007fc5b2cdaaec in ?? () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#7  0x00007fc5b26affa5 in g_closure_invoke () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#8  0x00007fc5b26c1fc1 in ?? () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#9  0x00007fc5b26ca7f9 in g_signal_emit_valist () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#10 0x00007fc5b26cb08f in g_signal_emit () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#11 0x00007fc5b2df293c in ?? () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#12 0x00007fc5b2cd97dd in gtk_main_do_event () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#13 0x00007fc5b2934b9f in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#14 0x00007fc5b2931671 in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#15 0x00007fc5b2931fa8 in gdk_window_process_all_updates () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#16 0x00007fc5b2932009 in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#17 0x00007fc5b2910d57 in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#18 0x00007fc5b23d904a in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#19 0x00007fc5b23d93f0 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x00007fc5b23d9712 in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#21 0x00007fc5b2cd8697 in gtk_main () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#22 0x00000000004bc563 in main (argc=5, argv=0x7ffc03130dc8) at snis_client.c:22977

Looks like a divide by zero (determinant).

#0  0x0000000000447198 in mat33_inverse_transpose_ff (src=0x7ffc0312ea60, output=0x7ffc0312ecc0) at matrix.c:98
98      output->m[0][0] = (+(src->m[1][1] * src->m[2][2] - src->m[2][1] * src->m[1][2])) / determinant;
(gdb) print src
$1 = (const struct mat33 *) 0x7ffc0312ea60
(gdb) print *src
$2 = {m = {{-2566.12329, -914.41803, 1575.97766}, {8.47752571, -13.4616375, 5.9929862}, {0, 0, 0}}}
(gdb) print *output
$3 = {m = {{-0.240327522, -0.191889003, -0.0884378403}, {-1.00906324, 1.24067652, 0.0501318052}, {0.312821895, 0.316523135, -1.53686523}}}
(gdb) print determinant
$4 = 0
(gdb) 
smcameron commented 4 years ago

The above seems to be attempting to invert a non-invertible matrix in order to calculate the Normal Matrix, which uh, is used to transform vertex normals, tangents and bitangents into eye space for normal mapping, for one thing... which that seems to be working.

It will taking some poking to figure this one out.

smcameron commented 4 years ago

This commit appears to break lighting on the weapons screen (maybe elsewhere too, but most visibly on the weapons screen): 4bb5e1868a55684283d5ba681cc6b409f59302e5

Correct lighting: correct-lighting

Incorrect lighting: incorrect-lighting

smcameron commented 4 years ago

f298a80fa5de379b9bea1069edffcb4448405573 revert avoid nans in process vertex normals

Reverted 4bb5e1868a55684283d5ba681cc6b409f59302e5

smcameron commented 4 years ago

@jv4779 A long shot, but... any idea why is this multiplying a vector by an angle? (only thing I can think of is that for small angles, sin(angle) ~= angle, but I don't think that applies here, and I think the angles aren't that small anyway.)

https://github.com/smcameron/space-nerds-in-space/blob/master/stl_parser.c#L415

Originally from this commit: 9d0f7f4be8af023f344be14622cd7e8524d76841

smcameron commented 4 years ago

Ok, I think I got rid of the NaNs from process_vertex_normals() with this: 82e0c9bdf154d63b0d70793269e73474ef12be62

smcameron commented 4 years ago
smcameron commented 4 years ago

This patch will avoid the divide by zero in mat33_inverse_transpose_ff(), but I'm not sure it does the right thing. I'm not quite sure why the matrix we're trying to inverse has no inverse. So not committing this yet.

diff --git a/matrix.c b/matrix.c
index 7ff4a88..3db633d 100644
--- a/matrix.c
+++ b/matrix.c
@@ -95,6 +95,9 @@ struct mat33 *mat33_inverse_transpose_ff(const struct mat33 *src, struct mat33 *
                - src->m[0][1] * (src->m[1][0] * src->m[2][2] - src->m[1][2] * src->m[2][0])
                + src->m[0][2] * (src->m[1][0] * src->m[2][1] - src->m[1][1] * src->m[2][0]);

+       if (determinant == 0)
+               return output; /* The matrix has no inverse */
+
        output->m[0][0] = (+(src->m[1][1] * src->m[2][2] - src->m[2][1] * src->m[1][2])) / determinant;
        output->m[0][1] = (-(src->m[1][0] * src->m[2][2] - src->m[2][0] * src->m[1][2])) / determinant;
        output->m[0][2] = (+(src->m[1][0] * src->m[2][1] - src->m[2][0] * src->m[1][1])) / determinant;
smcameron commented 4 years ago
smcameron commented 4 years ago

I think the following two fix the root cause of the NaNs in mat33_inverse_transpose_ff() -- some components of entity->scale could get set to zero. This could happen for sparks which shrink over time -- they could shrink all the way to a scale of zero. Also laserbeams had a z-scale of 0.

It might be the case that the first patch needs to check if any of the components of scale are zero rather than if all are zero, but that might have the effect of suppressing rendering of things which should be rendered (e.g. if I had done that, I wouldn't have noticed the problem with laser beams, they just would not be rendered at all.)

In general if we get an FPE in mat33_inverse_transpose_ff(), use gdb on the core file and do "up" until we're in render_entity(), then print *e, look at the scale, and look at the material_ptr. The material_ptr will usually give a clue about what kind of thing the entity is.

smcameron commented 4 years ago

At this point, I'm not aware of any more sources of NaNs. It will run for quite a awhile with "--trap-nans" option active (Haven't tried overnight though.)