Need to squash all sources of NaNs in the code

smcameron commented 4 years ago

Found a variety of NaNs popping up in the code, and there are still some more left.

See:

This patch can help root them out:

Author: Stephen M. Cameron <stephenmcameron@gmail.com>
Date:   Sat Nov 16 18:52:38 2019 -0500

    debug client nan

    Signed-off-by: Stephen M. Cameron <stephenmcameron@gmail.com>

diff --git a/snis_client.c b/snis_client.c
index 84d2377..8b1b65e 100644
--- a/snis_client.c
+++ b/snis_client.c
@@ -52,6 +52,7 @@
 #include <netinet/tcp.h>
 #include <netinet/in.h>
 #include <getopt.h>
+#include <fenv.h>

 #include "arraysize.h"
 #include "build_bug_on.h"
@@ -22846,6 +22847,8 @@ int main(int argc, char *argv[])
        GtkWidget *vbox;
        int i;

+       feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);
+
        refuse_to_run_as_root("snis_client");
        displaymode = DISPLAYMODE_NETWORK_SETUP;

diff --git a/snis_server.c b/snis_server.c
index a289bcc..732977a 100644
--- a/snis_server.c
+++ b/snis_server.c
@@ -51,6 +51,7 @@
 #include <locale.h>
 #include <dirent.h>
 #include <signal.h>
+#include <fenv.h>

 #include "arraysize.h"
 #include "container-of.h"
@@ -28824,6 +28825,8 @@ int main(int argc, char *argv[])
        int port, i;
        struct timespec thirtieth_second;

+       feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);
+
        refuse_to_run_as_root("snis_server");
        take_your_locale_and_shove_it();

One commit of interest that I noticed:

2264abf301cfae51df84958c3caf89541eb8c0c9 "derelict code is making some strange triangles which are producing normals that were NaN"

Would probably be better to fix the derelict generation code to not produce degenerate triangles.

smcameron commented 4 years ago

So many NaNs! How does this thing work so good with so many NaNs!

95c61edf939cafbe3481463489255727707d9787 Add --trap-nans option to allow aborting whenever NaNs are produced.
f0be089d376e43d7b506808cfc86541696d459a6 Avoid NaNs in compute_triangle_normal()
bddd260eb8a48544d41c62eb095ad472d43ab85c avoid NaNs in calculate_triangle_normal()
6c7925996b6173ed48ba39a50a53af2fa5a48070 Avoid NaNs in process_vertex_normals
feb5ebe9727078dc3aace89398900309d0d999a6 Fix uninitialized variable use in update_shading_planet()

And one more that I haven't figured out yet in the main render loop

Core was generated by `.//bin/snis_client --lobbyhost localhost --fullscreen --trap-nans'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x0000000000447198 in mat33_inverse_transpose_ff (src=0x7ffc0312ea60, output=0x7ffc0312ecc0) at matrix.c:98
98      output->m[0][0] = (+(src->m[1][1] * src->m[2][2] - src->m[2][1] * src->m[1][2])) / determinant;
[Current thread is 1 (Thread 0x7fc5b33c7a00 (LWP 10964))]
(gdb) bt
#0  0x0000000000447198 in mat33_inverse_transpose_ff (src=0x7ffc0312ea60, output=0x7ffc0312ecc0) at matrix.c:98
#1  0x0000000000442cd2 in calculate_model_matrices (c=0x441c050, f=0x7ffc0312edf0, e=0x7fc594545968, transform=0x7ffc0312eb40) at entity.c:420
#2  0x000000000044379b in render_entity (cx=0x441c000, f=0x7ffc0312edf0, e=0x7fc594545968, camera_light_pos=0x7ffc0312ede0) at entity.c:533
#3  0x0000000000445a95 in render_entities (cx=0x441c000) at entity.c:1065
#4  0x000000000048687f in show_mainscreen (w=0x423b0a0) at snis_client.c:9391
#5  0x00000000004b450d in main_da_expose (w=0x423b0a0, event=0x7ffc0312f980, p=0x0) at snis_client.c:20431
#6  0x00007fc5b2cdaaec in ?? () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#7  0x00007fc5b26affa5 in g_closure_invoke () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#8  0x00007fc5b26c1fc1 in ?? () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#9  0x00007fc5b26ca7f9 in g_signal_emit_valist () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#10 0x00007fc5b26cb08f in g_signal_emit () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#11 0x00007fc5b2df293c in ?? () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#12 0x00007fc5b2cd97dd in gtk_main_do_event () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#13 0x00007fc5b2934b9f in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#14 0x00007fc5b2931671 in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#15 0x00007fc5b2931fa8 in gdk_window_process_all_updates () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#16 0x00007fc5b2932009 in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#17 0x00007fc5b2910d57 in ?? () from /usr/lib/x86_64-linux-gnu/libgdk-x11-2.0.so.0
#18 0x00007fc5b23d904a in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#19 0x00007fc5b23d93f0 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x00007fc5b23d9712 in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#21 0x00007fc5b2cd8697 in gtk_main () from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#22 0x00000000004bc563 in main (argc=5, argv=0x7ffc03130dc8) at snis_client.c:22977

Looks like a divide by zero (determinant).

#0  0x0000000000447198 in mat33_inverse_transpose_ff (src=0x7ffc0312ea60, output=0x7ffc0312ecc0) at matrix.c:98
98      output->m[0][0] = (+(src->m[1][1] * src->m[2][2] - src->m[2][1] * src->m[1][2])) / determinant;
(gdb) print src
$1 = (const struct mat33 *) 0x7ffc0312ea60
(gdb) print *src
$2 = {m = {{-2566.12329, -914.41803, 1575.97766}, {8.47752571, -13.4616375, 5.9929862}, {0, 0, 0}}}
(gdb) print *output
$3 = {m = {{-0.240327522, -0.191889003, -0.0884378403}, {-1.00906324, 1.24067652, 0.0501318052}, {0.312821895, 0.316523135, -1.53686523}}}
(gdb) print determinant
$4 = 0
(gdb)

smcameron commented 4 years ago

The above seems to be attempting to invert a non-invertible matrix in order to calculate the Normal Matrix, which uh, is used to transform vertex normals, tangents and bitangents into eye space for normal mapping, for one thing... which that seems to be working.

It will taking some poking to figure this one out.

smcameron commented 4 years ago

This commit appears to break lighting on the weapons screen (maybe elsewhere too, but most visibly on the weapons screen): 4bb5e1868a55684283d5ba681cc6b409f59302e5

Correct lighting: correct-lighting

Incorrect lighting: incorrect-lighting

smcameron commented 4 years ago

f298a80fa5de379b9bea1069edffcb4448405573 revert avoid nans in process vertex normals

Reverted 4bb5e1868a55684283d5ba681cc6b409f59302e5

smcameron commented 4 years ago

@jv4779 A long shot, but... any idea why is this multiplying a vector by an angle? (only thing I can think of is that for small angles, sin(angle) ~= angle, but I don't think that applies here, and I think the angles aren't that small anyway.)

https://github.com/smcameron/space-nerds-in-space/blob/master/stl_parser.c#L415

Originally from this commit: 9d0f7f4be8af023f344be14622cd7e8524d76841

smcameron commented 4 years ago

Ok, I think I got rid of the NaNs from process_vertex_normals() with this: 82e0c9bdf154d63b0d70793269e73474ef12be62

smcameron commented 4 years ago

773d9c14574b0254fc1c12aca9f63dfe509e67b9 Fix divide by zero in damcon_robot_move()

smcameron commented 4 years ago

This patch will avoid the divide by zero in mat33_inverse_transpose_ff(), but I'm not sure it does the right thing. I'm not quite sure why the matrix we're trying to inverse has no inverse. So not committing this yet.

diff --git a/matrix.c b/matrix.c
index 7ff4a88..3db633d 100644
--- a/matrix.c
+++ b/matrix.c
@@ -95,6 +95,9 @@ struct mat33 *mat33_inverse_transpose_ff(const struct mat33 *src, struct mat33 *
                - src->m[0][1] * (src->m[1][0] * src->m[2][2] - src->m[1][2] * src->m[2][0])
                + src->m[0][2] * (src->m[1][0] * src->m[2][1] - src->m[1][1] * src->m[2][0]);

+       if (determinant == 0)
+               return output; /* The matrix has no inverse */
+
        output->m[0][0] = (+(src->m[1][1] * src->m[2][2] - src->m[2][1] * src->m[1][2])) / determinant;
        output->m[0][1] = (-(src->m[1][0] * src->m[2][2] - src->m[2][0] * src->m[1][2])) / determinant;
        output->m[0][2] = (+(src->m[1][0] * src->m[2][1] - src->m[2][0] * src->m[1][1])) / determinant;

smcameron commented 4 years ago

801331dbb20c73bd7137a7c45fe97a39790fbaef Avoid divide by zero in render_entities() when culling small objects

smcameron commented 4 years ago

c5fa0d3a5a65aeb846325ce2a054fb1da7150775 Do not use model normal matrix in graph_dev_draw_nebula()

I think the following two fix the root cause of the NaNs in mat33_inverse_transpose_ff() -- some components of entity->scale could get set to zero. This could happen for sparks which shrink over time -- they could shrink all the way to a scale of zero. Also laserbeams had a z-scale of 0.

e78c2c6ea011a144896d595c6412ee022b77c333 Do not render entities with scale of zero
f290283d221e9b4eb3041ace3add717920cf0def Do not set z-scale of laser beam objects to zero

It might be the case that the first patch needs to check if any of the components of scale are zero rather than if all are zero, but that might have the effect of suppressing rendering of things which should be rendered (e.g. if I had done that, I wouldn't have noticed the problem with laser beams, they just would not be rendered at all.)

In general if we get an FPE in mat33_inverse_transpose_ff(), use gdb on the core file and do "up" until we're in render_entity(), then print *e, look at the scale, and look at the material_ptr. The material_ptr will usually give a clue about what kind of thing the entity is.

smcameron commented 4 years ago

At this point, I'm not aware of any more sources of NaNs. It will run for quite a awhile with "--trap-nans" option active (Haven't tried overnight though.)

smcameron / space-nerds-in-space

Need to squash all sources of NaNs in the code #236