smcameron / space-nerds-in-space

Multi-player spaceship bridge simulator game. Captain your starship through adventures with your friends. See https://smcameron.github.io/space-nerds-in-space
GNU General Public License v2.0
724 stars 72 forks source link

ERROR: CANNOT FIND POINTER TO PLAYER SHIP, SORRY! (Was "Black screen on connection to server occasionally") #135

Closed smcameron closed 5 years ago

smcameron commented 6 years ago

When running with a real network with around 5 clients on a wired LAN, occasionally after successfully connecting to the lobby and then attempting to connect to snis_server, the client would just show a black screen. The keys would still operate, (e.g. you could switch between screens with the function keys, and quit with escape, arrow left and enter) but you just couldn't see anything. I am not sure if anything requiring a network connection would work (e.g. switching to nav and yawing the ship, for example. I should try that next time I encounter the problem.) Restarting the client (quitting, then restarting via snis_launcher script) would sometimes fix the problem, although sometimes it would take several tries. I saw this on my laptop, and others saw it on theirs as well.

Don't really have any good explanation for it.

smcameron commented 6 years ago

I believe I have duplicated this once on my laptop with nothing other than localhost in the way of a network. I suspect it may have to do with a client coming up without the ROLE_MAINSCREEN, but with the displaymode == DISPLAYMODE_MAINSCREEN, but this is only a suspicion at this point. I'm not yet able to reliably reproduce it (have only reproduced it once, and I'm not quite sure how, but it involved bringing up two clients one with ROLE_MAINSCREEN and one without.)

smcameron commented 6 years ago

Reproduced it again, and was able to attach with gdb.

(gdb) info threads
  Id   Target Id         Frame 
  6    Thread 0x7f73456fe700 (LWP 4630) "threaded-ml" 0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
  5    Thread 0x7f73494d0700 (LWP 4631) "snis_client" 0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
  4    Thread 0x7f7346132700 (LWP 4632) "snis-nat-lang" 0x00007f7353ea618d in open64 () at ../sysdeps/unix/syscall-template.S:81
  3    Thread 0x7f7344efd700 (LWP 4633) "snisc-tts" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  2    Thread 0x7f7313fff700 (LWP 4650) "snisc-writer" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7f73568a8a40 (LWP 4623) "snis_client" 0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
(gdb) bt
#0  0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f73558d3fe4 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007f73558d430a in g_main_loop_run ()
   from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007f73561c2447 in gtk_main ()
   from /usr/lib/x86_64-linux-gnu/libgtk-x11-2.0.so.0
#4  0x0000000000412341 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f7313fff700 (LWP 4650))]
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000044f929 in gameserver_writer ()
#2  0x00007f7354390184 in start_thread (arg=0x7f7313fff700)
    at pthread_create.c:312
#3  0x00007f7353eb503d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f7344efd700 (LWP 4633))]
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000453011 in text_to_speech_thread_fn ()
#2  0x00007f7354390184 in start_thread (arg=0x7f7344efd700)
    at pthread_create.c:312
#3  0x00007f7353eb503d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f7346132700 (LWP 4632))]
#0  0x00007f7353ea618d in open64 () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f7353ea618d in open64 () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f7353e30ed8 in _IO_file_open (is32not64=8, read_write=8, prot=438, 
    posix_mode=<optimized out>, filename=0x7f7346131aa0 "\260uL", 
    fp=0x7f732c0008c0) at fileops.c:228
#2  _IO_new_file_fopen (fp=fp@entry=0x7f732c0008c0, 
    filename=filename@entry=0x4c75b0 "/tmp/snis-natural-language-fifo", 
    mode=<optimized out>, mode@entry=0x491d9f "r", is32not64=is32not64@entry=1)
    at fileops.c:333
#3  0x00007f7353e253d4 in __fopen_internal (
    filename=0x4c75b0 "/tmp/snis-natural-language-fifo", mode=0x491d9f "r", 
    is32=1) at iofopen.c:90
#4  0x000000000045b9f7 in monitor_natural_language_fifo ()
#5  0x00007f7354390184 in start_thread (arg=0x7f7346132700)
    at pthread_create.c:312
#6  0x00007f7353eb503d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) thread 5
[Switching to thread 5 (Thread 0x7f73494d0700 (LWP 4631))]
#0  0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f73545b89b8 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libportaudio.so.2
#2  0x00007f73545b95e0 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libportaudio.so.2
#3  0x00007f7354390184 in start_thread (arg=0x7f73494d0700)
    at pthread_create.c:312
#4  0x00007f7353eb503d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) thread 6
[Switching to thread 6 (Thread 0x7f73456fe700 (LWP 4630))]
#0  0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
81  in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007f7353ea7c9d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f734b5a1031 in ?? () from /usr/lib/x86_64-linux-gnu/libpulse.so.0
#2  0x00007f734b59283c in pa_mainloop_poll ()
   from /usr/lib/x86_64-linux-gnu/libpulse.so.0
#3  0x00007f734b592ece in pa_mainloop_iterate ()
   from /usr/lib/x86_64-linux-gnu/libpulse.so.0
#4  0x00007f734b592f80 in pa_mainloop_run ()
   from /usr/lib/x86_64-linux-gnu/libpulse.so.0
#5  0x00007f734b5a0fe3 in ?? () from /usr/lib/x86_64-linux-gnu/libpulse.so.0
#6  0x00007f734b144f08 in ?? ()
   from /usr/lib/x86_64-linux-gnu/pulseaudio/libpulsecommon-4.0.so
#7  0x00007f7354390184 in start_thread (arg=0x7f73456fe700)
    at pthread_create.c:312
#8  0x00007f7353eb503d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

It does NOT appear to be a case of displaymode == DISPLAYMODE_MAINSCREEN but role not containing ROLE_MAIN.

(gdb) print role
$1 = 1534
(gdb) print displaymode
$2 = 1
#define DISPLAYMODE_MAINSCREEN 0
#define ROLE_MAIN       (1 << DISPLAYMODE_MAINSCREEN)
#define ROLE_MAIN       (1 << DISPLAYMODE_MAINSCREEN)

That is to say, role is 1534 == 0x5FE, which is an even number, the low bit is not set, which means role does not contain ROLE_MAIN, which is 0x01, and displaymode == 1, which is DISPLAYMODE_NAVIGATION.

ROLE_NAVIGATION     (1 << DISPLAYMODE_NAVIGATION)

and 0x5FE contains 0x010 so role and displaymode are consistent with one another.

smcameron commented 6 years ago

Ended up with 3 defunct snis_client processes.

sirius space-nerds-in-space # ps aux | grep snis_client
scameron  4320  1.5  0.0      0     0 pts/0    Z+   11:37   0:21 [snis_client] <defunct>
scameron  4538  1.1  0.0      0     0 pts/0    Z+   11:39   0:14 [snis_client] <defunct>
scameron  4623  1.0  0.0      0     0 pts/0    Z+   11:40   0:13 [snis_client] <defunct>
root      5191  0.0  0.0  11760  2140 pts/2    S+   12:00   0:00 grep --colour=auto snis_client

Very mysterious.

Edit: Duh, the defunct clients are due to snis_launcher still running. When I quit that, the defunct snis_clients disappeared. snis_launcher should probably have some mechanism to wait for subprocesses, or else detach them.

smcameron commented 6 years ago

Here is a big clue.

When th game is experiencing the "black screen" problem, it is calling advance_game() at 30Hz as it is supposed to (I put a printf in there to verify.)

When you quit from the black screen (by typing ESC then LEFT ARROW then ENTER), only then does it stop calling advance game and then afterwards it initializes GLEW. WTF? In the output below, "Initialized GLEW" is after I selected quit. So something is getting stuck and running out of the intended sequence.

Advance game, timer = 8823
Advance game, timer = 8824
Advance game, timer = 8825
Advance game, timer = 8826
Advance game, timer = 8827
Advance game, timer = 8828
Advance game, timer = 8829
Advance game, timer = 8830
Advance game, timer = 8831
Advance game, timer = 8832
Advance game, timer = 8833
Advance game, timer = 8834
Initialized GLEW
OpenGL 3.0 available
sRGB texture supported
reading 'asteroid.stl'
zzz radius 0 = 35.870609
reading 'asteroid2.stl'
zzz radius 1 = 43.860279
reading 'asteroid3.stl'
zzz radius 2 = 51.153412
reading 'asteroid4.stl'
zzz radius 3 = 50.523045
reading starbase model 1 of 6 'starbase/starbase.obj'
parsing material library: mtllib starbase.mtl
Opening ./share/snis/models/starbase//starbase.mtl
ignoring unknown data '4':o Starbase
ignoring usemtl: usemtl Material.001
ignoring unknown data '2882':s 1
ignoring usemtl: usemtl Material.001_NONE
reading starbase model 2 of 6 'starbase2/starbase2.obj'
parsing material library: mtllib starbase2.mtl
Opening ./share/snis/models/starbase2//starbase2.mtl
ignoring unknown data '4':o Starbase2
ignoring usemtl: usemtl Material.001
ignoring unknown data '890':s 1
reading starbase model 3 of 6 'starbase3.stl'
reading starbase model 4 of 6 'starbase4.stl'
reading starbase model 5 of 6 'starbase5.stl'
reading starbase model 6 of 6 'starbase6.stl'
parsing material library: mtllib disruptor.mtl
Opening ./share/snis/models/disruptor//disruptor.mtl
ignoring unknown data '4':o Disruptor
ignoring usemtl: usemtl Material
ignoring unknown data '1788':s 1
parsing material library: mtllib research-vessel.mtl
Opening ./share/snis/models/research-vessel//research-vessel.mtl
ignoring unknown data '4':o Research-Vessel
ignoring usemtl: usemtl Material
ignoring unknown data '7839':s 1
parsing material library: mtllib conqueror.mtl
Opening ./share/snis/models/conqueror//conqueror.mtl
ignoring unknown data '4':o Conqueror
ignoring usemtl: usemtl Material.001
ignoring unknown data '1335':s 1
parsing material library: mtllib snis3006.mtl
Opening ./share/snis/models/wombat//snis3006.mtl
ignoring unknown data '4':o Wombat.001_Wombat.000
ignoring usemtl: usemtl Paint.001
ignoring unknown data '9359':s 1
parsing material library: mtllib dreadknight.mtl
Opening ./share/snis/models/dreadknight//dreadknight.mtl
ignoring unknown data '4':o Dreadknight
ignoring usemtl: usemtl Material.001
ignoring unknown data '1495':s 1
parsing material library: mtllib enforcer.mtl
Opening ./share/snis/models/enforcer//enforcer.mtl
ignoring unknown data '4':o Enforcer
ignoring usemtl: usemtl Material
ignoring unknown data '18467':s 1
triangle 828, vertices 0 and 1 are welded together 0x712a598.
triangle 1134, vertices 0 and 2 are welded together 0x712a598.
parsing material library: mtllib cargocontainer.mtl
Opening ./share/snis/models/cargocontainer//cargocontainer.mtl
ignoring unknown data '4':o Cargocontainer
ignoring usemtl: usemtl Material.001
ignoring unknown data '456':s 1
snis_client: connect to lobby pressed
lobbyname = 'localhost'
shipname = 'REVENGE'
password = 'dpr'
snis_client: connecting to lobby...
Trying to connect to lobby.
snis_client: connecting to lobby
snis_client: Connected to lobby
lobby connect to server button pressed
Notifying server, opcode update player
Wrote update player opcode
starting gameserver reader thread
snis_client: gameserver reader thread
started gameserver reader thread
starting gameserver writer thread
started gameserver writer thread
update_universe_timestamp i=0, update_time=1529170894.332243, timestamp=1201, time_delta=0.039809 offset=1529170774.192434
SET MY SHIP ID to 681
Saving default shipname as 'REVENGE'
snis_client: Protocol error in gameserver reader, opcode = 0
snis_client: : Last buffer length read from socket 15 was 1
snis_client: : 00 
snis_client: last opcode was 127, before that 120
snis_client: total successful opcodes = 117
lobby socket = 14, done with lobby
8798 frames / 301 seconds, 29.2292 frames/sec
server netstats: 0 bytes sent, 0 bytes recd, secs = 1, bw = 0 bytes/sec
smcameron commented 6 years ago

Hmm, no longer sure about the above "big clue". I tried again, but this time looked into /proc/pid/fd/2 which pointed to snis_client_log.txt, and got this on quitting:

Advance game, timer = 3864
Advance game, timer = 3865
Advance game, timer = 3866
Advance game, timer = 3867
Advance game, timer = 3868
Advance game, timer = 3869
Advance game, timer = 3870
snis_client: Protocol error in gameserver reader, opcode = 0
snis_client: : Last buffer length read from socket 15 was 1
snis_client: : 00 
snis_client: last opcode was 127, before that 120
snis_client: total successful opcodes = 230
lobby socket = 14, done with lobby
3832 frames / 137 seconds, 27.9708 frames/sec
server netstats: 0 bytes sent, 0 bytes recd, secs = 1, bw = 0 bytes/sec

Also, init_gl() must have gone successfully because we got through the lobby screen, and that uses graphics.

smcameron commented 6 years ago

Verified I can change the displaymode with F-keys on a "black screen" process by printing it out in advance_game(). Also verified main_da_expose() is getting called at 30Hz.

Advance game, timer = 7421, displaymode = 4, role & 1 << displaymode = 10
main_da_expose
Advance game, timer = 7422, displaymode = 4, role & 1 << displaymode = 10
main_da_expose
Advance game, timer = 7423, displaymode = 4, role & 1 << displaymode = 10
main_da_expose
Advance game, timer = 7424, displaymode = 4, role & 1 << displaymode = 10
main_da_expose
Advance game, timer = 7425, displaymode = 4, role & 1 << displaymode = 10
main_da_expose
Advance game, timer = 7426, displaymode = 4, role & 1 << displaymode = 10
snis_client: Protocol error in gameserver reader, opcode = 0
snis_client: : Last buffer length read from socket 15 was 1
snis_client: : 00 
snis_client: last opcode was 127, before that 120
snis_client: total successful opcodes = 234
lobby socket = 13, done with lobby
7411 frames / 257 seconds, 28.8366 frames/sec
server netstats: 0 bytes sent, 0 bytes recd, secs = 1, bw = 0 bytes/sec
smcameron commented 6 years ago

While debugging, I added a delay on the CONNECTED screen, which is normally never seen, so that it is seen for 10 frames. When I reproduced the "black screen" problem, I saw the CONNECTED screen, and then very briefly (1 frame or less) I saw the Nav screen, then the black screen. I have noticed that it seems that I have to run quite a few clients on the laptop to reproduce. The first one never has the problem, and it usually takes 4 to 6 clients before the problem appears.

CONNECTED screen debug hack was this:

@@ -7315,7 +7320,7 @@ static void *connect_to_gameserver_thread(__attribute__((unused)) void *arg)

        displaymode = DISPLAYMODE_CONNECTED;
        done_with_lobby = 1;
-       displaymode = role_to_displaymode(role);
+       // displaymode = role_to_displaymode(role);

        /* Should probably submit this through the packed buffer queue...
         * but, this works.
@@ -7395,9 +7400,17 @@ static void show_connecting_screen(GtkWidget *w)

 static void show_connected_screen(GtkWidget *w)
 {
+       static int countdown = 10;
        sng_set_foreground(UI_COLOR(lobby_connecting));
        sng_abs_xy_draw_string("CONNECTED TO SERVER", SMALL_FONT, txx(100), txy(300) + LINEHEIGHT);
        sng_abs_xy_draw_string("DOWNLOADING GAME DATA", SMALL_FONT, txx(100), txy(300) + LINEHEIGHT * 3);
+       if (done_with_lobby) {
+               countdown--;
+               if (countdown == 0)
+                       displaymode = role_to_displaymode(role);
+       } else {
+               countdown = 10;
+       }
 }
smcameron commented 6 years ago

Aha! It's this, in main_da_expose() in snis_client.c ...

17567         if (displaymode < DISPLAYMODE_FONTTEST) {
17568                 if (!(o = find_my_ship())) {
17569                         fprintf(stderr, "find_my_ship() returned NULL!\n");
17570                         goto end_of_drawing;
17571                 

The main drawing function is unable to find the pointer to the player's ship for some reason.

smcameron commented 6 years ago

Mitigated slightly by de348a2599e8ba9b2dcacb9e5052993b98d27322

smcameron commented 6 years ago

The server is sending the correct ship id and the client is receiving it via the OPCODE_ID_CLIENT_SHIP opcode. However, the server does not send all the objects over because:

snis_server(sirius):: calling add new player
snis_server(sirius):: reading update player packet
snis_server(sirius):: read update player packet, rc = 0
snis_server(sirius):: new client: sn='REVENGE', pw='dpr', create = 0
snis_server(sirius):: c->bridge = 0
snis_server(sirius):: join existing ship
snis_server(sirius):: queue client id 717
server: -----> sending client id 717
snis_server: client refcount = 2
client reader thread exiting, c->socket = -1
snis_server: client refcount = 1
client writer thread exiting, c->socket = -1
snis_server: client refcount = 0
snis_server: client count of bridge 0 = 6
snis_server: calling remove_client 6
snis_server: remove_client 6 returned

The client reader and client writer threads in the server exit for that client because c->socket is -1. Not sure why it's -1 though, considering it does successfully send the ship id to the client.

smcameron commented 6 years ago

Ok.... connection reset by peer. But why?

server: -----> sending client id 633
snis_server: client refcount = 2
snis_server(sirius):: read from multiverse, rc = 0, sock = 7
snis_server(sirius):: reading from multiverse sock = 7...
snis_readsocket: Connection reset by peer
client reader thread exiting, c->socket = -1
snis_server: client refcount = 1
client writer thread exiting, c->socket = -1
snis_server: client refcount = 0
snis_server: client count of bridge 0 = 2
snis_server: calling remove_client 2
snis_server: remove_client 2 returned
smcameron commented 6 years ago

Confirmed with gdb that the snis_client's gameserver_sock is indeed -1, and the snisc-reader thread is nowhere to be found (snisc-writer is there though.)

But I am not seeing anywhere that the gameserver reader thread exited. Maybe it never got created? Pthread create failed perhaps? No, we should catch that.

 7348         printf("starting gameserver reader thread\n");
 7349         rc = create_thread(&read_from_gameserver_thread, gameserver_reader, NULL, "snisc-reader", 1);
 7350         if (rc) {
 7351                 fprintf(stderr, "Failed to create gameserver reader thread: %d '%s', '%s'\n",
 7352                         rc, strerror(rc), strerror(errno));
 7353         }
 7354         printf("started gameserver reader thread\n");

Ok... seems to be a protocol error (which I should have noticed in comment 6, above)

SET MY SHIP ID to 667
Saving default shipname as 'REVENGE'
SET SOLARSYSTEM TO 'sirius'
Reading solarsystem specifications...done
snis_client: Protocol error in gameserver reader, opcode = 0
snis_client: last opcode was 127, before that 120
snis_client: total successful opcodes = 230
snis_writesocket: Bad file descriptor
snis_text_to_speech.sh crew member has joined.
Re-loading per solarsystem textures
smcameron commented 6 years ago
snis_client: Protocol error in gameserver reader, opcode = 0
snis_client: : Last buffer length read from socket 15 was 1
snis_client: : 00
snis_client: last opcode was 127, before that 120
snis_client: total successful opcodes = 230

So there's a stray 0 byte coming from somewhere? But only sometimes? wtf?

smcameron commented 6 years ago

I think I may have found a big part of the problem.

Consider the following change. Note that without this change, send_packet_to_all_clients_on_a_bridge() is getting called from within functions that are already per client functions. This means these function produce traffic that unintentionally scales as the square of the number of clients, which is not good at all.

@@ -13136,9 +13136,7 @@ static void pack_and_send_ship_sdata_packet(struct game_client *c, struct snis_e
                p.lifeform_count = o->tsd.starbase.lifeform_count;
        else
                p.lifeform_count = 0;
-       pthread_mutex_unlock(&universe_mutex);
        send_ship_sdata_packet(c, &p);
-       pthread_mutex_lock(&universe_mutex);
 }

 static void send_update_ship_cargo_info(struct game_client *c, struct snis_entity *o)
@@ -13179,9 +13177,7 @@ static void send_update_ship_cargo_info(struct game_client *c, struct snis_entit
                qty = cbc->qty;
                packed_buffer_append(pb, "bwS", (uint8_t) i, item, qty, (int32_t) 1000000);
        }
-       pthread_mutex_unlock(&universe_mutex);
-       send_packet_to_all_clients_on_a_bridge(c->shipid, pb, ROLE_SCIENCE);
-       pthread_mutex_lock(&universe_mutex);
+       pb_queue_to_client(c, pb);
 }

 static int save_sdata_bandwidth(void)
@@ -19675,7 +19671,7 @@ static void send_ship_sdata_packet(struct game_client *c, struct ship_sdata_pack
                sip->shield_strength, sip->shield_wavelength, sip->shield_width, sip->shield_depth,
                sip->faction, sip->lifeform_count,
                sip->name, (unsigned short) sizeof(sip->name));
-       send_packet_to_all_clients_on_a_bridge(c->shipid, pb, ROLE_ALL);
+       pb_queue_to_client(c, pb);
 }

 static void send_generic_ship_damage_packet(struct snis_entity *o, uint8_t opcode)

With this patch, I can run 10 clients on my laptop without bumping into the connection reset problem (which I've not previously been able to do.) Need to verify this doesn't break anything, and I don't know if it fixes the connection reset problem or merely avoids it.

Here's a screen shot of 10 clients running just for fun.

screenshot from 2018-06-16 22 30 17

smcameron commented 6 years ago

Committed 2d595e2a4dae5ecb442e8370ec7f67c32427111d

Was able to run 16 clients plus server on a single laptop and they all behaved responsively and didn't run into the peer reset connection issue.

smcameron commented 6 years ago

I do not think I have found the root cause on this, I think I've just made it much less likely to occur. I can't explain the protocol error. So I am going to leave this bug open, but I suspect it will very rarely be seen, except with relatively large numbers of clients.

The symptoms of this bug are now different though. Now, instead of a black screen, you should see: "ERROR: CANNOT FIND POINTER TO PLAYER SHIP, SORRY!" on the client. The client will have terminated the connection due to having received an invalid packet from the server (a single zero byte, in cases I've seen.) I don't know where this malformed packet is coming from within the server.

smcameron commented 6 years ago

Re-opened. Didn't mean to close it.

smcameron commented 6 years ago

It occurs to me that without change 2d595e2a4dae5ecb442e8370ec7f67c32427111d there was this very questionable mutex manipulation.

    pthread_mutex_unlock(&universe_mutex);
    send_packet_to_all_clients_on_a_bridge(c->shipid, pb, ROLE_SCIENCE);
    pthread_mutex_lock(&universe_mutex);

unlocking, then locking the universe_mutex here is a mistake. I think I must have gotten confused and thought that because send_packet_to_all_clients_on_a_bridge() takes the client lock, that I needed to release the universe lock to avoid a deadlock. However, in cases where the code takes both locks, the universe_mutex is taken first, then the client lock. So I shouldn't have needed to release the universe lock there, and doing so may have allowed a race condition which may be the source of that stray zero. I don't have a concrete theory about how that specific outcome could occur, but that unlock definitely looks incorrect.

smcameron commented 5 years ago

Haven't seen this issue occur in almost a year now. I'm going to consider it fixed.