Open nickstenning opened 1 year ago
I work with @nickstenning and wanted to provide additional context to this bug:
It looks like this is the same bug at #670
Doing a bit of debugging it appears that when a spool is cycled via the spool_fetch_msg_callback and spool_fetch_msg functions, if someone has passed an invalid message id, the code falls through the CASE MSG_NOTFOUND
to CASE MSG_EXPIRED
which in turn results in hitting https://github.com/slact/nchan/blob/ff48ec7e50cbd63ece8aab9468acf5aaa9a1f0b7/src/store/spool.c#L479 with a status of MSG_EXPECTED
instead of MSG_INVALID
.
Since this is an assert() (yes I know we can define NDEBUG
to disable assert()) this highlights that there are scenarios where the message is not the type expected. Either invalid message IDs should be caught earlier and the status should be set to spool->msg_status of MSG_INVALID
or the code should handle the case of the message status being incorrect.
The simple patch I've applied to our system in production adds handling to convert a MSG_EXPECTED
to MSG_INVALID
in spool.c after the assertion that spool->msg
is null.
I didn't see a clear mechanism to validate the message ID passed in from Last-Event-ID
header or the last_event_id
query string.
This issue is amplified under load as there is a higher likelihood that the spool will be run through the offending code. When this happens all subscribers connected to the particular nginx worker loses their connection / stream and must reconnect to continue receiving.
Using the code as is without defining NDEBUG
results in an unreliable service. If the asserts are truly assert()s that are intended to be used in development, it seems some downstream distributions are not compiling with NDEBUG
defined (seems to be highlighted in #670) . If they're intended to indicate critical "this is broken behavior" then this particular case needs to be updated. I'm interested to understand if the assert() calls are guarding critical cases I'm not seeing.
I'm happy to contribute my patch, but it feels like it might be the wrong approach for the project.
Reproduction steps include adding load to nchan with the rough config Nick provided and then request a non-existent channel with an invalid id in 'Last-Event-ID' header. I found it was easier to reproduce if I interrupted the connection before the 30 second timeout. If nchan had no load it was much harder to get a crash. Not providing a last-event-id seems to preclude the assert being hit.
Ostensibly fewer workers will amplify it.
This is with a built-from-source version as Nick outlined.
I've recently moved something from our setup to the latest version of nchan and have been looking at this error. I have found a way to reproduce via docker, and after a single message.
nginx.conf
# Generated from: https://github.com/slact/nchan#conceptual-overview
load_module /usr/local/nginx/modules/ngx_nchan_module.so;
worker_processes 5;
events {
worker_connections 1024;
}
http {
server {
listen 80;
location = /sub {
nchan_subscriber websocket;
nchan_channel_id foobar;
}
location = /pub {
nchan_publisher;
nchan_channel_id foobar;
nchan_message_timeout 5s;
}
}
}
🧠 I think the nchan_message_timeout
has something to do with this problem
⚠️ I have the message timeout set rather short because I think it has something to do with that, so some commands might be consistent (e.g. getting from the publisher to get the current queue length etc)
Run the container (even though nginx.conf
above is copied during container build, I found mapping the file made it easier to test config changes)
$ docker run -it -v "$(pwd)/nginx.conf:/etc/nginx/nginx.conf" -p 8000:80 nchan:msg-invalid
Publish
$ curl -X POST -d "hello" "http://localhost:8000/pub"
Check enqueued:
$ curl -X GET "http://localhost:8000/pub"
queued messages: 1
last requested: -1 sec. ago
active subscribers: 0
last message id: 1704197750:0
Take note of last message id
☝️
Wait 5 seconds, check again:
$ curl --verbose -X GET "http://localhost:8000/pub"
...
< HTTP/1.1 404 Not Found
...
Connect to the websocket (I use https://www.npmjs.com/package/wscat)
$ npx wscat -c "ws://localhost:8000/sub?last_message_id=1704197750:0
Connected (press CTRL+C to quit)
>
✅ Get a valid websocket connection. Ctrl+D or Ctrl+C to kill the connection. Nginx log:
172.17.0.1 - - [02/Jan/2024:12:18:50 +0000] "GET /sub?last_event_id=1704197750:0 HTTP/1.1" 101 2 "-" "-"
Retry the websocket (last_message_id needs to have some value - needs to be some value)
$ npx wscat -c "ws://localhost:8000/sub?last_message_id=1704197750:0
Connected (press CTRL+C to quit)
Disconnected (code: 1006, reason: "")
Nginx log:
Assertion failed: spool->msg_status == MSG_INVALID (//nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)
2024/01/02 12:22:34 [alert] 1#1: worker process 22 exited on signal 6 (core dumped)
Unknowns:
Another less consistent way to get the spool error:
Post a message onto the queue
$ curl -X POST -d "hello" "http://localhost:8000/pub"
Listen to the websocket (Note, using the current timestage with $(date+%s)
):
$ npx wscat -c "ws://localhost:8000/sub?last_event_id=$(date +%s):0" -x quit
☝️ Run the above a couple of times, get the following log:
172.17.0.1 - - [02/Jan/2024:12:32:15 +0000] "POST /pub HTTP/1.1" 202 101 "-" "curl/8.5.0"
172.17.0.1 - - [02/Jan/2024:12:32:19 +0000] "GET /sub?last_event_id=1704198738:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:21 +0000] "GET /sub?last_event_id=1704198740:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:23 +0000] "GET /sub?last_event_id=1704198742:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:24 +0000] "GET /sub?last_event_id=1704198744:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:26 +0000] "GET /sub?last_event_id=1704198745:0 HTTP/1.1" 101 11 "-" "-"
Assertion failed: spool->msg_status == MSG_INVALID (//nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)
2024/01/02 12:32:27 [alert] 1#1: worker process 20 exited on signal 6 (core dumped)
Thanks for that @b-n , I'll have a fix shortly
fixed in master (commit 7765ec14be78c32a3522c019b35b79cc1b585968 )
This fix touches a very hot code path, so it took some time to verify. I will tag a version in another week or so.
I've built a new image from bleeding edge (e.g. d73ab7a034b8fa9cb18d59bbf473ca65048054e8), and we're currently using it in production.
Good news - it works as it was prior, nothing particular breaking. I was hoping that this would fix the error we were seeing, however still seeing it coming in with production workloads:
Assertion failed: spool->msg_status == MSG_INVALID (//nchan-d73ab7a034b8fa9cb18d59bbf473ca65048054e8/src/store/spool.c: spool_fetch_msg: 477) │
│ 2024/01/26 11:45:36 [alert] 1#1: worker process 2100 exited on signal 6
It's very possible that my reproduction steps above actually found a different error that you managed to fix though 🤷. Note: I haven't tried my reproduction steps above on the master changes, just running the master branch in production essentially and looking at logs. I'll see if I can gather any other information and/or trigger points.
@b-n I found that it didn't require a message on the queue for the requested channel (in fact I think if the channel has a message there isn't a crash). (We aren't using websocket, but a more boring pub/sub).
The core is that someone is requesting an invalid message id (I think on a channel with nothing in it).
We have some users that poll ancient channels we will not ever publish to again (outside of a uuid collision) and it seems this is the triggering scenario. I also noticed that when the load on nchan dropped off it was Much much less likely to trigger. There is some kind of weird race-y condition. I wish I had more concrete repro steps for @slact to work from.
Yeah I've noticed something similar too. e.g. We're using this on a homebrewed deployment tool (that we're slowly refactoring) to show statuses of development environments, and streaming the logs to the UI.
Some developers have page pinned, so when their machine starts up, it just starts pulling the old messages from the last event time. In this case, we're using nchan.js on the FE side. I even went as far to force a page refresh every hour (meta content-reload), and to override the localstorage value when the page loads. Of course there are people that just have an inactive page from the time I implemented these changes as work arounds, so it still hasn't stopped people requesting old event ids. Last event Id is still handy to poll for us since we want the client to reconnect if the connection drops.
I can confirm that it looks like everything is working fine, until someone sends an old last_event_id request that is expired (at least that's when I see it happening). I'm not sure if it needs to be a specific event that has already expired, or whether it's just "In the past prior to an expired event". And it also looks like when the worker does eventually get this old message and dies, that it takes all other connections with it of course. Makes the log hard to read since the order they are reported to nginx is not necessarily the order you'd expect. Sorry for the wall of text!
I can confirm it can be never used/never existing, my local repro steps used both expired and never existing.
@tempusfrangit could you please check the fix I posted to master? I believe this will fix your issue as well.
@b-n welp, ok. Think you can get me a trace from a coredump?
@b-n @tempusfrangit nevermind i figured it out. will post a fix later today.
alright fix pushed to master with tests for both known ways to trigger the bug
https://github.com/slact/nchan/blob/cc705a948cf6797f669b64649605f007f6f26a42/src/store/spool.c#L380-L381 No, it wasn't right. It wasn't right at all.
Ah nice! I'll give it a roll on Monday and let you know the results :smile:
Curious, if we were not using the in memory version, but the redis backed version, would we have skipped this issue altogether? (I was tempted to go through the code, but thought it's easier to just ask).
@b-n we use the redis backed one so you'd still have hit it.
@b-n Nchan uses the in-memory storage engine as a local cache layer for Redis, so yes this would affect Redis users too.
@slact Looks like there are some build errors showing up now
This is a build where NCHAN_VERSION
= bacb7b674899e37b56a277f9d4107d1b14696cd7
(Note from reproduction dockerfile, need to remove the v
in the wget
call since vGIT_HASH isn't a thing)
PR here to fix it @slact https://github.com/slact/nchan/pull/679
I have tested on my local docker build, and all compiles correctly after this small patch.
Also, if you'd like a github action that does builds on master to ensure they compile, I'm more than happy to contribute - Just let me know and I'll put together a PR.
@slact I finally got around to actually implementing the change. It looks like there are no more MSG_INVALID errors coming through for us.
I will keep 👀 on this thread/next realease since our current build targets the commit currently on master. But yeah, good news 😄
Hi,
Same error seen in below:
$/usr/local/nginx/sbin/nginx -V
nginx version: nginx/1.27.2
built with OpenSSL 1.0.2k-fips 26 Jan 2017
TLS SNI support enabled
configure arguments: --add-module=/opt/nchan-1.3.7/ --with-http_auth_request_module --with-http_ssl_module
$cat nginx_error.log
/opt/nchan-1.3.7//src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2024/10/09 17:59:16 [alert] 1980#0: worker process 1986 exited on signal 6 (core dumped)
/opt/nchan-1.3.7//src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2024/10/09 18:30:46 [alert] 1980#0: worker process 1985 exited on signal 6 (core dumped)
$cat /proc/version
Linux version 4.19.91-009.ali4000.alios7.x86_64 (admin@ae2f75727d32) (gcc version 6.5.1 20181026 (Alibaba 6.5.1-1) (GCC)) #1 SMP Mon Jan 25 10:47:38 CST 2021
Hello! We're running nchan in production and encountering this error which is regularly crashing worker processes and dumping core. We might be able to share a core dump if that's helpful but as it may contain customer data that might not be that easy unless we can find a way to reproduce the problem in a non-production environment.
The crash is occurring repeatedly for periods of a few hours and then not for hours at a time, which makes me think it may be related to a particular payload or client.
We're running nchan 1.3.6 and nginx 1.21.3 on alpine 3.18.2, i.e. with musl.
Assertion failed: spool->msg_status == MSG_INVALID (/usr/src/nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)
We have a relatively straightforward nchan setup. I'll share some (lightly edited) configs below.
publish.conf
```nginx nchan_shared_memory_size 1024M; upstream streaming_api_redis { # stop nchan connecting to the read replica and treating it as a master nchan_redis_discovered_ip_range_blacklist 10.0.0.0/8; nchan_redis_server redis://.../0; } server { listen 8003; listen [::]:8003; server_name _; location ~ /_api/messages/(\w+)/events$ { nchan_publisher; nchan_redis_pass streaming_api_redis; nchan_channel_group messages; nchan_channel_id $1; nchan_message_buffer_length 2000; } } ```subscribe-http.conf
```nginx server { listen 80 default_server; listen [::]:80 default_server; location = /health-check { access_log off; return 200 'OK'; } location ~ /v1/messages/(\w+)$ { nchan_subscriber; nchan_redis_pass streaming_api_redis; nchan_channel_group messages; nchan_channel_id $1; nchan_subscriber_timeout 30; add_header 'Link' 'subscribe-https.conf
```nginx ssl_certificate /var/lib/certs/current/tls.crt; ssl_certificate_key /var/lib/certs/current/tls.key; server { listen 80 default_server; listen [::]:80 default_server; server_name _; return 301 https://$host$request_uri; } server { listen 443 ssl default_server; listen [::]:443 ssl default_server; # Note: we can't enable http/2 as nchan event source stream doesn't # currently work: https://github.com/slact/nchan/issues/576 server_name _; ssl_session_timeout 1d; ssl_session_cache shared:Default:10m; # about 40000 sessions ssl_session_tickets off; ssl_dhparam /etc/nginx/dhparam.pem; ssl_protocols TLSv1.2 TLSv1.3; ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305; ssl_prefer_server_ciphers off; # enable STS when we're happy with the setup #add_header Strict-Transport-Security "max-age=63072000" always; # TODO: consider enabling OCSP stapling location = /health-check { access_log off; return 200 'OK'; } location ~ /v1/messages/(\w+)$ { nchan_subscriber; nchan_redis_pass streaming_api_redis; nchan_channel_group messages; nchan_channel_id $1; nchan_subscriber_timeout 30; add_header 'Link' '