netcreateorg / netcreate-2018

Please report bugs, problems, ideas in the project Issues page: https://github.com/netcreateorg/netcreate-2018/issues
Other
11 stars 2 forks source link

Crash in NC server #143

Open jdanish opened 3 years ago

jdanish commented 3 years ago

Not sure the cause, but we saw the net.create went down, hit stop and then start and it was fine. I don't see anything in the log to suggest it even went down, but do see this in the console in DO. We can see if it happens again, but hoping it will give you an idea?

Screen Shot 2020-09-21 at 4 25 43 PM
jdanish commented 3 years ago

The content from the NC log right around this time:

21:01:21 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 39 Religious Mentalities 21:01:33 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 40 The Plague Psyche 21:01:36 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 11 Gentile da Foligno 21:01:44 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 8 Medical Faculty of the University of Paris 21:02:36 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 53 Padua 21:02:38 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 93 Perguia 21:02:39 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 91 Practical Experience 21:02:44 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 11 Gentile da Foligno 21:02:47 H213FA20201348 UADDR_06 [H213-1348-DL8] select node 91 Practical Experience 21:05:49 H213FA20201348 UADDR_08 left network 21:05:49 H213FA20201348 UADDR_09 joined network 21:05:49 H213FA20201348 UADDR_09 getdatabase 21:05:56 H213FA20201348 SRV-NET - UADDR_08 pong not received before time ran out -- CLIENT CONNECTION DEAD! 21:06:03 H213FA20201348 UADDR_09 left network 21:06:03 H213FA20201348 UADDR_10 joined network 21:06:03 H213FA20201348 UADDR_10 getdatabase 21:06:11 H213FA20201348 SRV-NET - UADDR_09 pong not received before time ran out -- CLIENT CONNECTION DEAD! 21:06:12 H213FA20201348 UADDR_10 select node 95 Funeral progression

jdanish commented 3 years ago

2020-09212020-0921-log-201859.txt

log.txt

benloh commented 3 years ago

It's hard to say what happened, but it does kind of look like there was a rapid series of disconnects and reconnects. It's probably significant that UADDR_08 is leaving at 21:05:49 and UADDR_09 is joining at the same second.

When you say "net.create went down" what do you mean? One particular instance was dead? All instances were dead? The nc-multiplex manager was dead?

What might be happening is that the quick disconnect and reconnect means that we're trying to send a websocket ping/pong message before the connection is fully closed.

A proper solution and further testing should probably involve more logging. But in the meantime, I've pushed an update to the dev-bl/ws-hardening branch. This checks to make sure the socket is OPEN before sending the ping. The DO console error suggests this might be the culprit.

@jdanish Please try it out and let me know how it goes. I haven't had a chance to do extensive testing yet, so please definitely give it a good whirl on your local machine before deploying it to students.

kalanicraig commented 3 years ago

In this case, the network went down and refused connections from all subsequent users but the multiplex still saw it as up and running. A start/stop cycle using multiplex brings it back up.

On Sep 22, 2020, at 2:29 PM, benloh notifications@github.com wrote:

It's hard to say what happened, but it does kind of look like there was a rapid series of disconnects and reconnects. It's probably significant that UADDR_08 is leaving at 21:05:49 and UADDR_09 is joining at the same second.

When you say "net.create went down" what do you mean? One particular instance was dead? All instances were dead? The nc-multiplex manager was dead?

What might be happening is that the quick disconnect and reconnect means that we're trying to send a websocket ping/pong message before the connection is fully closed.

A proper solution and further testing should probably involve more logging. But in the meantime, I've pushed an update to the dev-bl/ws-hardening branch. This checks to make sure the socket is OPEN before sending the ping. The DO console error suggests this might be the culprit.

@jdanish https://github.com/jdanish Please try it out and let me know how it goes. I haven't had a chance to do extensive testing yet, so please definitely give it a good whirl on your local machine before deploying it to students.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/netcreateorg/netcreate-2018/issues/143#issuecomment-696901089, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKL4NBRIM6SXE4TLXN6SSTSHDUILANCNFSM4RU7WYPA.

benloh commented 3 years ago

Note to self: Check heartbeat operation on the static file server port 3000 -- Is that the one that died? And if so, why? No one should be connecting to it.

benloh commented 3 years ago

@kalanicraig Thanks for the clarification. That puts the blame on netcreate and not nc-multiplex. Hopefully my fix will address that.