slact / nchan

Fast, horizontally scalable, multiprocess pub/sub queuing server and proxy for HTTP, long-polling, Websockets and EventSource (SSE), powered by Nginx.
https://nchan.io/
Other
2.99k stars 292 forks source link

nchan stops delivering messages #640

Closed daviddeutsch closed 1 year ago

daviddeutsch commented 2 years ago

I've been chasing this bug for months now and it's quite elusive, but, honestly I'm not even sure it's an nchan bug or whether I'm "holding it wrong".

I have a daemon running on my server that reads and writes to a websocket and those messages arrive at the client (js in browser) just fine. It seems, though, following some high traffic events where the daemon writes a lot into the websocket, it no longer accepts messages from the daemon.

The strange thing is: I can still receive messages just fine. Same thing for the client: all messages sent in get an echo and so forth.

I can see in the daemon that it has some kind of break in connection and then re-establishes the connection, but now it seems to be one-way only. Of course, I'm establishing the connection the exact same way every time, so it feels, strangely, like some kind of penalty session where a bad behaving client is no longer allowed to send but can still receive.

Again - it might just be my daemon doing something strange when writing to the connection, so ANY pointer for how I could debug this would be incredibly helpful.

I'm already tracking how often I write into the socket and how much data and so forth, but no reliable pattern has emerged.


This might be a helpful addition: I DID have, for a while, an issue where I could also no longer properly READ from the websocket in my daemon, too. It turned out that this was due to the previous read stopping mid-message and thus garbling all subsequent reads because they would, in a strange way, read the broken tail end from the previous message as the start of the new message.

Perhaps this is the same issue but the other way around, that the message my daemon (for what ever reason) sends into the websocket gets cut off and, going forward, all subsequent messages that arrive in nchan are corrupted like this?

daviddeutsch commented 2 years ago

I wanted to check which version I have installed through nchan_stub_status but it did not return any version, so as per the changelog, that probably means my version is <1.1.5.

This is using the latest debian package on ubuntu. Quite a shame that it's so outdated. I honestly did not expect it to be FIVE+ years old.

I'll try to update nginx and nchan first thing tomorrow morning and see where that gets me.

slact commented 2 years ago

yep, that version is 5 years old. I ping them now and again to update but it seems to fall on deaf ears.

daviddeutsch commented 2 years ago

@slact it might help to make that explicit on your site.

To make matters worse, they use a completely different version number and I have found no way of tracing which version of nchan was even being used. Even the internal package changelog only notes the changes of the package itself, not the included packages.

daviddeutsch commented 2 years ago

@slact the bad news is - it seems like it's still happening.

I upgraded to nginx 1.20.1 and nchan 1.2.15. this morning and am monitoring the server quite closely. Any help on how I could debug this would be greatly appreciated.

daviddeutsch commented 2 years ago

Again - the error is that I have a worker process that keeps a websocket connection and pushes update messages that get read by the browser.

After the update, it actually seems to have gotten worse because I used to be able to still read messages in my worker process, but that no longer happens, now.

Things that I'm currently considering:

What is most puzzling to me is that I'm trying to re-establish the connection in my worker process, but that leads to the failed state of being connected but not being able to read or write - only when I completely restart the process does it work out.

daviddeutsch commented 2 years ago

Apologies for the message spam. Just taking note of my debugging process.

Just after restarting, the stub_status reads:

total published messages: 4
stored messages: 0
shared memory used: 32K
shared memory limit: 131072K
channels: 19
subscribers: 26
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 63
interprocess alerts in transit: 27
interprocess queued alerts: 27
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.15

a short time later, with no big updates happening, it is again in an unresponsive state and the status reads:

total published messages: 21
stored messages: 0
shared memory used: 44K
shared memory limit: 131072K
channels: 19
subscribers: 43
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 63
interprocess alerts in transit: 59
interprocess queued alerts: 59
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.15

I'll enable channel events and debug this further.

daviddeutsch commented 2 years ago

Alright, so a different bug emerged to be the culprit of the "it's actually worse now" problem, this time in my own code.

I'll monitor this for a while and either update this if the problem does re-emerge or close if the update was indeed the fix-for-good (:crossed_fingers: :expressionless: :crossed_fingers:).

daviddeutsch commented 2 years ago

Alright, the issue has not resurfaced, so I'm closing this.

@slact two things, if you don't mind:

daviddeutsch commented 2 years ago

Alright, there is definitely still something there. In the meantime, it happened one or two times but right now, I cannot get nchan to react to anything whatsoever.

The previous two times, I just restarted nginx (sometimes I had to stop it, wait a while, then restart) and it would work again.

Right now, it's not echoing out anything whatsoever.

@slact Any pointers on how I might debug this or provide more information?

daviddeutsch commented 2 years ago

I've run into this a few more times, but it usually resolved itself within a few nginx restarts. Today, it does not, so I could do some more debugging. This is what the nchan_stub_status reports:

total published messages: 0
stored messages: 0
shared memory used: 16K
shared memory limit: 131072K
channels: 0
subscribers: 0
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 0
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.15

then, shortly afterwards:

total published messages: 15
stored messages: 0
shared memory used: 32K
shared memory limit: 131072K
channels: 3
subscribers: 10
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 39
interprocess alerts in transit: 24
interprocess queued alerts: 24
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.15

this is after sending a few messages from the browser. Do note that I certainly don't have 10 subscribers active, so I'm not sure where that number is coming from.

I've also put in a channel_events path as per the manual, but when i connect to it via websocat, the connection is successful yet entirely silent.

@slact do let me know what else I can do to help debug this.

slact commented 2 years ago

Could you show me your publisher and subscriber location configs, as well as the channel event subscriber location? It's odd to me that the stored messages count remains at 0.

daviddeutsch commented 2 years ago

With pleasure!

server {
        listen [::]:443 ssl http2;
        listen 443 ssl http2;
        ssl_certificate /root/.acme.sh/mydomain.com/mydomain.com.cer;
        ssl_certificate_key /root/.acme.sh/mydomain.com/mydomain.com.key;

        server_name mydomain.com www.mydomain.com;

        error_page 401 /401.html;
        error_page 403 /403.html;
        error_page 404 /404.html;
        error_page 500 /500.html;

        error_log /var/www/mydomain.com/error.log warn;
        access_log /var/www/mydomain.com/access.log anonymized;

        charset UTF-8;

        location / {
                root /var/www/mydomain.com/rendered/master/public;

                index index.html;

                try_files $uri $uri/ /index.html =404;

                location = /ws-auth {
                        proxy_pass http://mydomain.internal/ws-auth;
                        proxy_pass_request_body off;

                        proxy_set_header Content-Length "";
                        proxy_set_header X-Subscriber-Type $nchan_subscriber_type;
                        proxy_set_header X-Publisher-Type $nchan_publisher_type;
                        proxy_set_header X-Prev-Message-Id $nchan_prev_message_id;
                        proxy_set_header X-Channel-Id $nchan_channel_id;
                        proxy_set_header X-Original-URI $request_uri;
                        proxy_set_header X-Forwarded-For $remote_addr;
                }

                location = /auth {
                        proxy_pass http://mydomain.internal/auth;
                        proxy_pass_request_body off;

                        proxy_set_header Content-Length "";
                        proxy_set_header X-Original-URI $request_uri;
                        proxy_set_header X-Forwarded-For $remote_addr;
                }

                location ~ "/secure\/pubsub\/([a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12})" {
                        nchan_channel_id mydomain.com/$1;

                        nchan_authorize_request /ws-auth;

                        nchan_pubsub;
                        nchan_subscriber_first_message 0;
                }

                location = /secure/pubsub {
                        nchan_channel_id mydomain.com;

                        nchan_authorize_request /ws-auth;

                        nchan_pubsub;
                        nchan_subscriber_first_message 0;
                }
        }
}

server {
        listen 127.0.0.1:80;

        server_name mydomain.internal;

        location ~ "\/pubsub\/([a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12})" {
                nchan_channel_id mydomain.com/$1;
                nchan_pubsub;
                nchan_subscriber_first_message 0;
        }

        location ~ /channel_events {
            nchan_subscriber;
            nchan_channel_group meta;
            nchan_channel_id mydomain.com;
        }

        location /nchan_stub_status {
            nchan_stub_status;
        }

        location /pubsub {
                nchan_channel_id mydomain.com;
                nchan_pubsub;
        }
}

I've stripped out all of the things that I think you wouldn't need anyhow, let me know if anything is missing.

Oh and - the problem has, once again, just rectified itself after a few more restarts. I still have this eerie suspicion that the problem is that somehow the connections remain in a kind of zombified state and only after enough restarts does it discount them entirely and start fresh?

slact commented 2 years ago

Based on the way you've configured your channel_events location, you need to add nchan_channel_events_channel_id to the publisher and subscriber locations to enable the feature.

location ~ "\/pubsub\/([a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12})" {
                nchan_channel_id mydomain.com/$1;
                nchan_pubsub;
                nchan_subscriber_first_message 0;
                nchan_channel_events_channel_id mydomain.com;
        }

        location ~ /channel_events {
            nchan_subscriber;
            nchan_channel_group meta;
            nchan_channel_id mydomain.com;
        }
        #[...]

        location /pubsub {
                nchan_channel_id mydomain.com;
                nchan_pubsub;
                nchan_channel_events_channel_id mydomain.com;
        }
}

bear in mind this will send all events to a single meta channel named mydomain.com . This will be a performance bottleneck should you be testing this with heavy traffic.

daviddeutsch commented 2 years ago

My traffic isn't that heavy, so testing it this way would be feasible. I've added the lines (in both the first and the second server block).

I'm not sure I understand why this would be helpful in testing, though. Perhaps to explain my setup - I have one main "root" pubsub channel and then individual "client" channels. A browser client connects to "root" and their own (uuid denoted) channel. The "root" channel is the one I care about, mostly, and the one that goes away.

So basically - I do know that I'm not getting channel events for anything but the "root" channel - but that's the one I'm trying to debug. Or do you mean that I might be missing important events on those side-channels that could help debug this? Or does nchan need an explicit id in any case, here?

daviddeutsch commented 2 years ago

Just had a minor event that left the websocket unresponsive, but restarting nginx solved it. I was listening via websocat and this is the tail end of the output:

channel_publish mydomain.com
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 34 bytes
channel_publish mydomain.com
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 34 bytes
channel_publish mydomain.com
[INFO  websocat::ws_peer] incoming None
[DEBUG websocat::my_copy] BrokenPipe: read_done
[DEBUG websocat::my_copy] done
[INFO  websocat::sessionserve] Reverse finished
[DEBUG websocat::sessionserve] Reverse shutdown finished
slact commented 2 years ago

someone killed that connection Is this running through a load balancer?

daviddeutsch commented 2 years ago

Nope, no load balancing or anything else in the way. Just current nginx built with the latest nchan and then the browser.

I'm a little light on the details here. To clarify:

[INFO  websocat::ws_peer] incoming None
[DEBUG websocat::my_copy] BrokenPipe: read_done
[DEBUG websocat::my_copy] done

is websocat telling me: A null message was received, the pipe was closed. (or rather: it just went away)?

slact commented 2 years ago

It's saying TCP told it that the connection was closed

daviddeutsch commented 2 years ago

Hmm. Any idea on how I could debug this further?

daviddeutsch commented 2 years ago

Another incident yesterday. Once again - stopping nginx for about 5 minutes and then restarting it did the trick.

I checked the systemd logs to see if anything might have happened, but I did not see anything relevant:

Jun 17 19:01:44.290731 [myserver] sshd[1377951]: Received disconnect from 61.177.173.35 port 50405:11:  [preauth]
Jun 17 19:01:44.290752 [myserver] sshd[1377951]: Disconnected from authenticating user root 61.177.173.35 port 50405 [preauth]
Jun 17 19:01:47.960234 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:01:57.963416 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:02:07.956984 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:02:17.958992 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:02:27.961425 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:02:37.964250 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:02:47.966617 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:02:57.969873 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:03:07.962372 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:03:17.964782 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:03:27.967031 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:03:37.969856 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:03:47.971700 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:03:57.954633 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:04:07.967757 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:04:17.969422 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:04:27.966025 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:04:37.957789 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:04:47.971008 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:04:57.963045 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:05:07.964460 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:05:17.978042 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:05:27.960793 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:05:37.963039 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:05:47.964832 [myserver] mydaemon[447051]: PHP Warning:  [WS] READ ERROR mydomain.internal, [channel-uuid]: Reading header from websocket failed.
Jun 17 19:07:48.430634 [myserver] sshd[1449706]: Unable to negotiate with 61.177.173.54 port 56170: no matching key exchange method found. Their offer: diffie-hellman-group1-sha1,diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1 [preauth]
Jun 17 19:08:17.671767 [myserver] sshd[1455439]: Invalid user ftpuser from 92.255.85.70 port 59238
Jun 17 19:08:17.717975 [myserver] sshd[1455439]: Received disconnect from 92.255.85.70 port 59238:11: Bye Bye [preauth]
Jun 17 19:08:17.717988 [myserver] sshd[1455439]: Disconnected from invalid user ftpuser 92.255.85.70 port 59238 [preauth]
Jun 17 19:09:01.100299 [myserver] CRON[1464004]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 17 19:09:01.100596 [myserver] CRON[1464005]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jun 17 19:09:01.101254 [myserver] CRON[1464004]: pam_unix(cron:session): session closed for user root
Jun 17 19:09:04.462833 [myserver] systemd[1]: Starting Clean php session files...
Jun 17 19:09:04.555220 [myserver] systemd[1]: phpsessionclean.service: Succeeded.
Jun 17 19:09:04.555419 [myserver] systemd[1]: Finished Clean php session files.

In my event loop, I'm polling the socket about every 2ms. I'm getting consistent warnings - every 10 seconds - about not being able to read the header from the socket (it tries to fread the first two bytes and if it can't read the socket, it throws a warning) up until 19:06 and then they stop. Those warnings are kind of my canary in the coalmine here, I guess.

Other than those messages, it's just the odd bots trying to get into the server and the php session cleaner. Both don't appear to be related to the failure.

The error pattern is odd, though - it seems like it can read from the socket just fine after that (otherwise it'd throw an error), it's just not getting any data anymore.

daviddeutsch commented 2 years ago

Good news!

Another one, just now. The stub reads:

total published messages: 4098
stored messages: 20
shared memory used: 136K
shared memory limit: 131072K
channels: 16
subscribers: 60
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 249844
interprocess alerts in transit: 181
interprocess queued alerts: 181
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.15

The subscribers number is certainly too high. It seems like there are a lot of dangling subscribers that already logged off?

I open a websocat session and restart nginx:

[INFO  websocat::stdio_threaded_peer] get_stdio_peer (threaded)
[DEBUG websocat::sessionserve] Underlying connection established
[INFO  websocat::ws_client_peer] get_ws_client_peer
[INFO  websocat::ws_client_peer] Connected to ws
[INFO  websocat::ws_peer] Received WebSocket close message
[DEBUG websocat::ws_peer] The close message is Some(CloseData { status_code: 1001, reason: "410 Gone" })
[DEBUG websocat::my_copy] BrokenPipe: read_done
[DEBUG websocat::my_copy] done
[INFO  websocat::sessionserve] Reverse finished
[DEBUG websocat::sessionserve] Reverse shutdown finished

then reconnect and send in some data from the browser:

[INFO  websocat::stdio_threaded_peer] get_stdio_peer (threaded)
[DEBUG websocat::sessionserve] Underlying connection established
[INFO  websocat::ws_client_peer] get_ws_client_peer
[INFO  websocat::ws_client_peer] Connected to ws
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 71 bytes
channel_publish [mychannel]
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 71 bytes
channel_publish [mychannel]
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 71 bytes
channel_publish [mychannel]
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 71 bytes
channel_publish [mychannel]
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 74 bytes
subscriber_enqueue [myotherchannel]
[DEBUG websocat::ws_peer] incoming text
[DEBUG websocat::readdebt] Fullfulling the debt of 37 bytes
subscriber_enqueue schnick-schnack.de
[INFO  websocat::ws_peer] incoming None
[DEBUG websocat::my_copy] BrokenPipe: read_done
[DEBUG websocat::my_copy] done
[INFO  websocat::sessionserve] Reverse finished
[DEBUG websocat::sessionserve] Reverse shutdown finished

it looks similar in the browser:

image

Then I restart nginx, the browser reconnects, but new messages no longer get an echo:

image

But now for the good part

an honest-to-goodness segfault happened during the restart:

Jun 18 09:14:11.053615 [myserver] systemd[1]: Stopping The NGINX HTTP and reverse proxy server...
Jun 18 09:14:11.061542 [myserver] systemd[1]: nginx.service: Succeeded.
Jun 18 09:14:11.061775 [myserver] systemd[1]: Stopped The NGINX HTTP and reverse proxy server.
Jun 18 09:14:11.062857 [myserver] systemd[1]: Starting The NGINX HTTP and reverse proxy server...
Jun 18 09:14:11.071582 [myserver] nginx[4072684]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Jun 18 09:14:11.072980 [myserver] nginx[4072684]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Jun 18 09:14:11.084185 [myserver] systemd[1]: Started The NGINX HTTP and reverse proxy server.
Jun 18 09:14:24.310854 [myserver] kernel: nginx[4072733]: segfault at 55fe525f81c0 ip 000055fe502b1e3d sp 00007ffcd2211580 error 4 in nginx[55fe50292000+116000]
Jun 18 09:14:24.310939 [myserver] kernel: Code: 80 00 00 00 00 48 8b 46 08 48 2b 06 48 01 ca 48 83 c0 58 48 01 d0 c3 66 0f 1f 44 00 00 f3 0f 1e fa 55 49 89 f8 48 89 d5 89 f7 <48> 8b 52 40 4c 89 c6 ff 55 38 f6 45 48 10 75 0b 31 c0 5d c3 0f 1f

two more restarts after this:

Jun 18 09:17:01.229397 [myserver] CRON[4072740]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 18 09:17:01.230748 [myserver] CRON[4072739]: pam_unix(cron:session): session closed for user root
Jun 18 09:17:23.669685 [myserver] sshd[4072742]: Invalid user alvin from 161.35.84.23 port 54572
Jun 18 09:17:23.694537 [myserver] sshd[4072742]: Received disconnect from 161.35.84.23 port 54572:11: Bye Bye [preauth]
Jun 18 09:17:23.694559 [myserver] sshd[4072742]: Disconnected from invalid user alvin 161.35.84.23 port 54572 [preauth]
Jun 18 09:18:24.146965 [myserver] sshd[4072746]: Invalid user admin from 103.252.219.170 port 37968
Jun 18 09:18:24.298653 [myserver] sshd[4072746]: Received disconnect from 103.252.219.170 port 37968:11: Bye Bye [preauth]
Jun 18 09:18:24.298675 [myserver] sshd[4072746]: Disconnected from invalid user admin 103.252.219.170 port 37968 [preauth]
Jun 18 09:20:17.164151 [myserver] systemd[1]: Stopping The NGINX HTTP and reverse proxy server...
Jun 18 09:20:17.170495 [myserver] systemd[1]: nginx.service: Succeeded.
Jun 18 09:20:17.170724 [myserver] systemd[1]: Stopped The NGINX HTTP and reverse proxy server.
Jun 18 09:20:17.171774 [myserver] systemd[1]: Starting The NGINX HTTP and reverse proxy server...
Jun 18 09:20:17.180523 [myserver] nginx[4072775]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Jun 18 09:20:17.181940 [myserver] nginx[4072775]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Jun 18 09:20:17.192671 [myserver] systemd[1]: Started The NGINX HTTP and reverse proxy server.
Jun 18 09:21:41.000762 [myserver] systemd[1]: Stopping The NGINX HTTP and reverse proxy server...
Jun 18 09:21:41.007014 [myserver] systemd[1]: nginx.service: Succeeded.
Jun 18 09:21:41.007293 [myserver] systemd[1]: Stopped The NGINX HTTP and reverse proxy server.
Jun 18 09:21:41.008421 [myserver] systemd[1]: Starting The NGINX HTTP and reverse proxy server...
Jun 18 09:21:41.017243 [myserver] nginx[4072835]: nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
Jun 18 09:21:41.018599 [myserver] nginx[4072835]: nginx: configuration file /etc/nginx/nginx.conf test is successful
Jun 18 09:21:41.031350 [myserver] systemd[1]: Started The NGINX HTTP and reverse proxy server.

and now it's connecting properly again: image

and the stub_status shows half the number of subscribers:

total published messages: 60
stored messages: 24
shared memory used: 32K
shared memory limit: 131072K
channels: 23
subscribers: 29
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 5574
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.15

I'm still puzzled by this - are these all resurrected socket connections?

The recovery was a lot faster this time, so that part is still inconsistent, but it seems possible that I could at least write a short watcher script that checks websocket pingbacks and just restarts nginx until it works again. That would at least improve my uptime somewhat :sweat_smile:

I'd still like to know how to get rid of the bug, though, but debugging segfaults in nginx is a bit beyond my abilities. Again - let me know what I can do to figure this out.

daviddeutsch commented 2 years ago

Hmm, I've since had at least one other incident where messages were no longer echoed and there was no segfault…

slact commented 2 years ago

The fact that you have to restart for a certain amount of time, or certain number of times to fix the issue means the problem lies outside of Nchan and Nginx. Otherwise a single restart would fix the issue, because a restart means all connections are closed and the entire state of the server is reset.

Have you tried enabling Websocket pings? You can use the nchan_websocket_ping_interval to send ping frames to subscribers -- either they will get cleand up due to undelivered pongs, or they will keep the connection alive if something is killing it in the middle. Either way, give it a try and let me know if anything changes.

daviddeutsch commented 2 years ago

means the problem lies outside of Nchan and Nginx

Yeah, I'm kind of wrecking my brain about this one because… well, there just isn't much running on this server. My local machine has ten times the processes that could get in the way and it never shows the same failure pattern. But for the webserver, it's literally nginx, nchan, fail2ban, php-fpm, systemd (and a bunch of associated processes) and then a bunch of standard ubuntu daemons (like cupsd, cron…). You could write the list on a napkin, basically.

The only one of these that I'm not 100% sure about is php-fpm, because I did have that one fail on me before, occasionally and in weird ways, but so far I could not make out a pattern connected to the nchan woes.

If you had an idea for what I could look into here, or just in general some insight into what kind of pattern I'd be looking for here, that would be greatly appreciated.

I think I did try setting a ping_interval before, but I'll reenable it to see whether it does anything.

slact commented 2 years ago

I think this calls for firing up Wireshark and figuring out which side is ending the connection and with what packet. That should give you some clues.

daviddeutsch commented 2 years ago

Huh, alright, I'll see if I can set up a test for that. (And I'm picking up on the subtext that I'm testing your patience, if that is the case - apologies)

I do still wonder, though - this would assume that there is an even that can happen which correctly puts nginx/nchan into a state where it starts up, I can connect to it and send messages, but they no longer get an echo?

slact commented 2 years ago

I'm picking up on the subtext that I'm testing your patience, if that is the case - apologies

Yes, but that's ok. It's there to be tested.

assume that there is an even that can happen which correctly puts nginx/nchan into a state where it starts up, I can connect to it and send messages, but they no longer get an echo?

Not necessarily, I just want to see what exactly is going on at the packet level -- who's closing the connection, and how is it being reopened. Right now I strongly suspect the issue is happening outside of Nchan & Nginx, and one good way to falsify that suspicion is with a pcap session showing that everything else is working correctly.

daviddeutsch commented 2 years ago

It's there to be tested.

Hah! We'll see about that!

As a first-time wireshark user, setting everything up was… surprisingly straightforward, even with the TLS stuff. (Plus it already paid off because I figured out I had some endless loop in my ServiceWorker that kept clobbering the server for a favicons.ico that isn't there.)

Seems to be working fine, this is the beginning of a session: image

So I'll just keep that running all the time until another incident happens and then I'll check the logs? Or would it be better as a background task on the server?

Also - Is it normal that my connection starts with a websocket close?

slact commented 2 years ago

Is it normal that my connection starts with a websocket close?

Nope, that's weird. If you look at the associated ports you'll probably see that it's from a different session. I also don't see the initiating HTTP request / Websocket handshake.

daviddeutsch commented 2 years ago

Ah yeah, I was filtering that out, sorry. Here is what happens when I clean the log, close the browser and reopen it:

image

The first thing that surprised me was that there is no connection being closed when I close the browser - the log just stays silent. Perhaps that is something that I should be doing in my code, explicitly? I was just assuming the browser would close those automatically… Then again, I did have some code for this, but perhaps it's not being called properly:

image

Anyhow, the WebSocket entries then, again, start with a Connection Close after the 101.

It also, apparently, tries to connect three times when it should only do two -the main "system notifications" channel and the user channel. It connects to the user channel twice, though

I'm not sure where I would be looking for session info. Digging into the details for the Connection Closed message, I could not find any mention of it. What layer would that be, anyhow?

daviddeutsch commented 2 years ago

And this is what it looks like when I explicitly use stop() on those two connections and then reload the page in my browser:

image

slact commented 2 years ago

You can ID the TCP connections easily by the source & destination ip & port pairs. You have those columns collapsed, but they carry important info.

daviddeutsch commented 2 years ago

Well yeah, for (questionable) security purposes :wink: It's just the two IPs repeated, though - mine and that of the server.

slact commented 2 years ago

Right, the ports are different columns from the IP. my bad.

daviddeutsch commented 2 years ago

Understood - will add columns for the src/dest ports in future reports :handshake:

This is disconnecting/reconnecting with a running browser: image

This is what it looks like when I close and restart the browser: image

both times, I did three tries for good measure.

daviddeutsch commented 2 years ago

No incident so far, so just taking note of what seems odd, like this reconnect: image Also saw this a few times: image and another reconnect: image

daviddeutsch commented 2 years ago

ok, the plot might be thickening yet again. I didn't catch the disconnect this morning, but my client reported not being able to connect. It seems, though, that nchan was running just fine, but my auth proxy was broken.

Here is the log from when I connected:

image

As you can see, the odd thing is that I get straight up 502/504 errors, but then, it's connecting just fine and it even receives websocket pings! However, as we know - my own messages never go anywhere.

When I then restart my php-fpm (there's a standard php fastcgi setup behind /ws-auth), I can reconnect just fine!

I will test this in more detail, but perhaps that is already our actual error state here: nchan_authorize_request can apparently leave you in some kind of zombie connection state if the auth proxy messes up.

daviddeutsch commented 2 years ago

I worked on php-fpm some more (increased worker pool size, made sure the daemon is being restarted on fail) but there was still another incident.

Also - Just restarting php-fpm did not do the trick, I did also need to restart nginx. It worked fine after that, though, so no multiple restarts or long waits - perhaps that part was indeed due to php-fpm.

daviddeutsch commented 2 years ago

alright, now I definitely had wireshark running during a disconnect. But I'm not quite sure what I'm looking for. I'll quote a few things that stood out to me and maybe you can let me know what is actually interesting: image image

morebrackets commented 1 year ago

Hey David, I had the same issues no matter what I tried. I threw my hands up and built my own websockets server. Too bad because nchan has potential.

daviddeutsch commented 1 year ago

@morebrackets interesting - would you be able to share more details about your setup and how the error manifested? Perhaps that would help establish a pattern.

Good to hear you found a solution although I'd rather not take it that far (building my own websockets server) because I really like nchan as a "close to the metal" solution. :sweat_smile:

@slact did you have a chance to review my screenshots?

morebrackets commented 1 year ago

@daviddeutsch It's been about 6-months but I remember that Nchan would freeze or segfault randomly when a client disconnected. This occurred most frequently when about 10 or more concurrent connections were open. The weird part was that the remaining client websocket connections would remain intact, likely to Nginx, but the Nchan was unresponsive.

daviddeutsch commented 1 year ago

@morebrackets thanks, that might already be helpful.

Is there anything you can share about the setup, though? As I've written above, I have a suspicion that php-fpm being flakey might have something to do with this.

morebrackets commented 1 year ago

@daviddeutsch I was using php-fpm as well for authentication and had no problems with it. The issue I ran into happened on disconnect. I believe there was a segfault error appearing in system logs but don't recall specifically. Want me to share my php script?

daviddeutsch commented 1 year ago

@morebrackets it does not sounds like it's relevant, but sure - if you'd like to share it, it certainly can't hurt.

@slact I've now added another daemon to my setup that pings my websocket every five seconds and, when it finds it will no longer get an echo, stops my other (websocket listening) services, then restarts nginx and php-fpm and the services.

It feels like a giant cheat, but it's at least better than doing it all by hand. Also - this way I'll get some more data on how often this happens and clear entries in the log tables that might help figure out the timing of these incidents.

daviddeutsch commented 1 year ago

Had my upchecker daemon catch the first incident tonight and I was able to identify a cause: Apparently, when the acme.sh cron job does its thing, it then leaves nginx/nchan in our zombie state.

This is weird, though, because I can't even see how acme.sh restarts nginx. Hmm…

daviddeutsch commented 1 year ago

Just to confirm: It fails somewhat consistently after the acme.sh cron runs

Jul 14 00:41:13.655355  upchecker[3335013]: [WS] no longer hearing back from websocket, stopping services
Jul 16 00:41:09.436351  upchecker[3335013]: [WS] no longer hearing back from websocket, stopping services
Jul 19 00:41:07.239533  upchecker[3335013]: [WS] no longer hearing back from websocket, stopping services
slact commented 1 year ago

@daviddeutsch can you try the latest Nchan version (1.3.1)? I fixed a bunch of stuff and may have touched upon this bug.

daviddeutsch commented 1 year ago

@slact finally managed to go through with the update - I'll keep you posted! Thanks!

daviddeutsch commented 1 year ago

I will keep looking for another week, but, so far, I have not run into another problem aside from the nighttime acme.sh resets.

I guess in a weeks time, we can close this one for good. :+1: