slact / nchan

Fast, horizontally scalable, multiprocess pub/sub queuing server and proxy for HTTP, long-polling, Websockets and EventSource (SSE), powered by Nginx.
https://nchan.io/
Other
2.99k stars 292 forks source link

Worker process core dump: Assertion failed: spool->msg_status == MSG_INVALID #676

Open nickstenning opened 8 months ago

nickstenning commented 8 months ago

Hello! We're running nchan in production and encountering this error which is regularly crashing worker processes and dumping core. We might be able to share a core dump if that's helpful but as it may contain customer data that might not be that easy unless we can find a way to reproduce the problem in a non-production environment.

The crash is occurring repeatedly for periods of a few hours and then not for hours at a time, which makes me think it may be related to a particular payload or client.

We're running nchan 1.3.6 and nginx 1.21.3 on alpine 3.18.2, i.e. with musl.

Assertion failed: spool->msg_status == MSG_INVALID (/usr/src/nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)

We have a relatively straightforward nchan setup. I'll share some (lightly edited) configs below.

publish.conf ```nginx nchan_shared_memory_size 1024M; upstream streaming_api_redis { # stop nchan connecting to the read replica and treating it as a master nchan_redis_discovered_ip_range_blacklist 10.0.0.0/8; nchan_redis_server redis://.../0; } server { listen 8003; listen [::]:8003; server_name _; location ~ /_api/messages/(\w+)/events$ { nchan_publisher; nchan_redis_pass streaming_api_redis; nchan_channel_group messages; nchan_channel_id $1; nchan_message_buffer_length 2000; } } ```
subscribe-http.conf ```nginx server { listen 80 default_server; listen [::]:80 default_server; location = /health-check { access_log off; return 200 'OK'; } location ~ /v1/messages/(\w+)$ { nchan_subscriber; nchan_redis_pass streaming_api_redis; nchan_channel_group messages; nchan_channel_id $1; nchan_subscriber_timeout 30; add_header 'Link' '; rel="self"'; } } ```
subscribe-https.conf ```nginx ssl_certificate /var/lib/certs/current/tls.crt; ssl_certificate_key /var/lib/certs/current/tls.key; server { listen 80 default_server; listen [::]:80 default_server; server_name _; return 301 https://$host$request_uri; } server { listen 443 ssl default_server; listen [::]:443 ssl default_server; # Note: we can't enable http/2 as nchan event source stream doesn't # currently work: https://github.com/slact/nchan/issues/576 server_name _; ssl_session_timeout 1d; ssl_session_cache shared:Default:10m; # about 40000 sessions ssl_session_tickets off; ssl_dhparam /etc/nginx/dhparam.pem; ssl_protocols TLSv1.2 TLSv1.3; ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-RSA-CHACHA20-POLY1305; ssl_prefer_server_ciphers off; # enable STS when we're happy with the setup #add_header Strict-Transport-Security "max-age=63072000" always; # TODO: consider enabling OCSP stapling location = /health-check { access_log off; return 200 'OK'; } location ~ /v1/messages/(\w+)$ { nchan_subscriber; nchan_redis_pass streaming_api_redis; nchan_channel_group messages; nchan_channel_id $1; nchan_subscriber_timeout 30; add_header 'Link' '; rel="self"'; } } ```
tempusfrangit commented 6 months ago

I work with @nickstenning and wanted to provide additional context to this bug:

It looks like this is the same bug at #670

Doing a bit of debugging it appears that when a spool is cycled via the spool_fetch_msg_callback and spool_fetch_msg functions, if someone has passed an invalid message id, the code falls through the CASE MSG_NOTFOUND to CASE MSG_EXPIRED which in turn results in hitting https://github.com/slact/nchan/blob/ff48ec7e50cbd63ece8aab9468acf5aaa9a1f0b7/src/store/spool.c#L479 with a status of MSG_EXPECTED instead of MSG_INVALID.

Since this is an assert() (yes I know we can define NDEBUG to disable assert()) this highlights that there are scenarios where the message is not the type expected. Either invalid message IDs should be caught earlier and the status should be set to spool->msg_status of MSG_INVALID or the code should handle the case of the message status being incorrect.

The simple patch I've applied to our system in production adds handling to convert a MSG_EXPECTED to MSG_INVALID in spool.c after the assertion that spool->msg is null.

I didn't see a clear mechanism to validate the message ID passed in from Last-Event-ID header or the last_event_id query string.

This issue is amplified under load as there is a higher likelihood that the spool will be run through the offending code. When this happens all subscribers connected to the particular nginx worker loses their connection / stream and must reconnect to continue receiving.

Using the code as is without defining NDEBUG results in an unreliable service. If the asserts are truly assert()s that are intended to be used in development, it seems some downstream distributions are not compiling with NDEBUG defined (seems to be highlighted in #670) . If they're intended to indicate critical "this is broken behavior" then this particular case needs to be updated. I'm interested to understand if the assert() calls are guarding critical cases I'm not seeing.

I'm happy to contribute my patch, but it feels like it might be the wrong approach for the project.

tempusfrangit commented 6 months ago

Reproduction steps include adding load to nchan with the rough config Nick provided and then request a non-existent channel with an invalid id in 'Last-Event-ID' header. I found it was easier to reproduce if I interrupted the connection before the 30 second timeout. If nchan had no load it was much harder to get a crash. Not providing a last-event-id seems to preclude the assert being hit.

Ostensibly fewer workers will amplify it.

This is with a built-from-source version as Nick outlined.

b-n commented 6 months ago

I've recently moved something from our setup to the latest version of nchan and have been looking at this error. I have found a way to reproduce via docker, and after a single message.

Dockerfile (with build instructions) Build with: `docker build -t nchan:msg-invalid .` `Dockerfile` ```Dockerfile # Original Source https://gist.github.com/hermanbanken/96f0ff298c162a522ddbba44cad31081 (with tweaks) ################# # Builder Image # ################# # https://hub.docker.com/_/nginx - stable-alpine-slim FROM nginx:1.24.0-alpine-slim AS builder # nginx:alpine contains NGINX_VERSION environment variable which we can re-use, like so: # ENV NGINX_VERSION 1.24.0 # Our NCHAN version ENV NCHAN_VERSION 1.3.6 # Download sources RUN wget "http://nginx.org/download/nginx-${NGINX_VERSION}.tar.gz" -O nginx.tar.gz && \ wget "https://github.com/slact/nchan/archive/v${NCHAN_VERSION}.tar.gz" -O nchan.tar.gz # For latest build deps, see https://github.com/nginxinc/docker-nginx/blob/master/stable/alpine/Dockerfile RUN apk add --no-cache --virtual .build-deps \ gcc \ libc-dev \ make \ openssl-dev \ pcre-dev \ zlib-dev \ linux-headers \ libxslt-dev \ gd-dev \ geoip-dev \ libedit-dev \ bash \ alpine-sdk \ findutils # Reuse same cli arguments as the nginx:alpine image used to build RUN CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') \ mkdir -p /usr/src && \ tar -zxC /usr/src -f nginx.tar.gz && \ tar -xzvf "nchan.tar.gz" && \ NCHANDIR="$(pwd)/nchan-${NCHAN_VERSION}" && \ cd /usr/src/nginx-$NGINX_VERSION && \ ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && \ make && make install ################ # Final Image # ################ FROM nginx:1.24.0-alpine-slim # Extract the dynamic module NCHAN from the builder image COPY --from=builder /usr/local/nginx/modules/ngx_nchan_module.so /usr/local/nginx/modules/ngx_nchan_module.so # Remove default nginx configs and add our base config RUN rm /etc/nginx/conf.d/default.conf COPY nginx.conf /etc/nginx/nginx.conf CMD ["nginx", "-g", "daemon off;"] ```

nginx.conf

# Generated from: https://github.com/slact/nchan#conceptual-overview
load_module /usr/local/nginx/modules/ngx_nchan_module.so;

worker_processes 5;

events {
  worker_connections 1024;
}

http {  
  server {
    listen 80;

    location = /sub {
      nchan_subscriber websocket;
      nchan_channel_id foobar;
    }

    location = /pub {
      nchan_publisher;
      nchan_channel_id foobar;
      nchan_message_timeout 5s;
    }
  }
}

🧠 I think the nchan_message_timeout has something to do with this problem

How to reproduce

⚠️ I have the message timeout set rather short because I think it has something to do with that, so some commands might be consistent (e.g. getting from the publisher to get the current queue length etc)

  1. Run the container (even though nginx.conf above is copied during container build, I found mapping the file made it easier to test config changes)

    $ docker run -it -v "$(pwd)/nginx.conf:/etc/nginx/nginx.conf" -p 8000:80 nchan:msg-invalid
  2. Publish

    $ curl -X POST -d "hello" "http://localhost:8000/pub"
  3. Check enqueued:

    $ curl -X GET "http://localhost:8000/pub"
    queued messages: 1
    last requested: -1 sec. ago
    active subscribers: 0
    last message id: 1704197750:0

    Take note of last message id ☝️

  4. Wait 5 seconds, check again:

    $ curl --verbose -X GET "http://localhost:8000/pub"
    ...
    < HTTP/1.1 404 Not Found
    ...
  5. Connect to the websocket (I use https://www.npmjs.com/package/wscat)

    $ npx wscat -c "ws://localhost:8000/sub?last_message_id=1704197750:0
    Connected (press CTRL+C to quit)
    > 

    ✅ Get a valid websocket connection. Ctrl+D or Ctrl+C to kill the connection. Nginx log:

    172.17.0.1 - - [02/Jan/2024:12:18:50 +0000] "GET /sub?last_event_id=1704197750:0 HTTP/1.1" 101 2 "-" "-"
  6. Retry the websocket (last_message_id needs to have some value - needs to be some value)

    $ npx wscat -c "ws://localhost:8000/sub?last_message_id=1704197750:0
    Connected (press CTRL+C to quit)
    Disconnected (code: 1006, reason: "")

Nginx log:

Assertion failed: spool->msg_status == MSG_INVALID (//nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)
2024/01/02 12:22:34 [alert] 1#1: worker process 22 exited on signal 6 (core dumped)

Unknowns:

b-n commented 6 months ago

Another less consistent way to get the spool error:

  1. Post a message onto the queue

    $ curl -X POST -d "hello" "http://localhost:8000/pub"
  2. Listen to the websocket (Note, using the current timestage with $(date+%s)):

    $ npx wscat -c "ws://localhost:8000/sub?last_event_id=$(date +%s):0" -x quit

☝️ Run the above a couple of times, get the following log:

172.17.0.1 - - [02/Jan/2024:12:32:15 +0000] "POST /pub HTTP/1.1" 202 101 "-" "curl/8.5.0"
172.17.0.1 - - [02/Jan/2024:12:32:19 +0000] "GET /sub?last_event_id=1704198738:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:21 +0000] "GET /sub?last_event_id=1704198740:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:23 +0000] "GET /sub?last_event_id=1704198742:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:24 +0000] "GET /sub?last_event_id=1704198744:0 HTTP/1.1" 101 11 "-" "-"
172.17.0.1 - - [02/Jan/2024:12:32:26 +0000] "GET /sub?last_event_id=1704198745:0 HTTP/1.1" 101 11 "-" "-"
Assertion failed: spool->msg_status == MSG_INVALID (//nchan-1.3.6/src/store/spool.c: spool_fetch_msg: 479)
2024/01/02 12:32:27 [alert] 1#1: worker process 20 exited on signal 6 (core dumped)
slact commented 5 months ago

Thanks for that @b-n , I'll have a fix shortly

slact commented 5 months ago

fixed in master (commit 7765ec14be78c32a3522c019b35b79cc1b585968 )

This fix touches a very hot code path, so it took some time to verify. I will tag a version in another week or so.

b-n commented 5 months ago

I've built a new image from bleeding edge (e.g. d73ab7a034b8fa9cb18d59bbf473ca65048054e8), and we're currently using it in production.

Good news - it works as it was prior, nothing particular breaking. I was hoping that this would fix the error we were seeing, however still seeing it coming in with production workloads:

Assertion failed: spool->msg_status == MSG_INVALID (//nchan-d73ab7a034b8fa9cb18d59bbf473ca65048054e8/src/store/spool.c: spool_fetch_msg: 477)                                                                │
│ 2024/01/26 11:45:36 [alert] 1#1: worker process 2100 exited on signal 6

It's very possible that my reproduction steps above actually found a different error that you managed to fix though 🤷. Note: I haven't tried my reproduction steps above on the master changes, just running the master branch in production essentially and looking at logs. I'll see if I can gather any other information and/or trigger points.

tempusfrangit commented 5 months ago

@b-n I found that it didn't require a message on the queue for the requested channel (in fact I think if the channel has a message there isn't a crash). (We aren't using websocket, but a more boring pub/sub).

The core is that someone is requesting an invalid message id (I think on a channel with nothing in it).

We have some users that poll ancient channels we will not ever publish to again (outside of a uuid collision) and it seems this is the triggering scenario. I also noticed that when the load on nchan dropped off it was Much much less likely to trigger. There is some kind of weird race-y condition. I wish I had more concrete repro steps for @slact to work from.

b-n commented 5 months ago

Yeah I've noticed something similar too. e.g. We're using this on a homebrewed deployment tool (that we're slowly refactoring) to show statuses of development environments, and streaming the logs to the UI.

Some developers have page pinned, so when their machine starts up, it just starts pulling the old messages from the last event time. In this case, we're using nchan.js on the FE side. I even went as far to force a page refresh every hour (meta content-reload), and to override the localstorage value when the page loads. Of course there are people that just have an inactive page from the time I implemented these changes as work arounds, so it still hasn't stopped people requesting old event ids. Last event Id is still handy to poll for us since we want the client to reconnect if the connection drops.

I can confirm that it looks like everything is working fine, until someone sends an old last_event_id request that is expired (at least that's when I see it happening). I'm not sure if it needs to be a specific event that has already expired, or whether it's just "In the past prior to an expired event". And it also looks like when the worker does eventually get this old message and dies, that it takes all other connections with it of course. Makes the log hard to read since the order they are reported to nginx is not necessarily the order you'd expect. Sorry for the wall of text!

tempusfrangit commented 5 months ago

I can confirm it can be never used/never existing, my local repro steps used both expired and never existing.

slact commented 5 months ago

@tempusfrangit could you please check the fix I posted to master? I believe this will fix your issue as well.

slact commented 5 months ago

@b-n welp, ok. Think you can get me a trace from a coredump?

slact commented 5 months ago

@b-n @tempusfrangit nevermind i figured it out. will post a fix later today.

slact commented 5 months ago

alright fix pushed to master with tests for both known ways to trigger the bug

https://github.com/slact/nchan/blob/cc705a948cf6797f669b64649605f007f6f26a42/src/store/spool.c#L380-L381 No, it wasn't right. It wasn't right at all.

b-n commented 5 months ago

Ah nice! I'll give it a roll on Monday and let you know the results :smile:

Curious, if we were not using the in memory version, but the redis backed version, would we have skipped this issue altogether? (I was tempted to go through the code, but thought it's easier to just ask).

tempusfrangit commented 5 months ago

@b-n we use the redis backed one so you'd still have hit it.

slact commented 5 months ago

@b-n Nchan uses the in-memory storage engine as a local cache layer for Redis, so yes this would affect Redis users too.

b-n commented 5 months ago

@slact Looks like there are some build errors showing up now

Compile error ```console #8 18.79 cc -c -fPIC -pipe -O -W -Wall -Wpointer-arith -Wno-unused-parameter -Werror -g -I src/core -I src/event -I src/event/modules -I src/os/unix -I //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src -I objs -I src/http -I src/http/modules \ #8 18.79 -o objs/addon/subscribers/benchmark.o \ #8 18.79 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/subscribers/benchmark.c #8 18.84 cc -c -fPIC -pipe -O -W -Wall -Wpointer-arith -Wno-unused-parameter -Werror -g -I src/core -I src/event -I src/event/modules -I src/os/unix -I //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src -I objs -I src/http -I src/http/modules \ #8 18.84 -o objs/addon/store/spool.o \ #8 18.84 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c: In function 'spool_fetch_msg_callback': #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:305:33: error: unused variable 'nuspool' [-Werror=unused-variable] #8 18.88 305 | subscriber_pool_t *spool, *nuspool; #8 18.88 | ^~~~~~~ #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:304:25: error: variable 'prev_status' set but not used [-Werror=unused-but-set-variable] #8 18.88 304 | nchan_msg_status_t prev_status; #8 18.88 | ^~~~~~~~~~~ #8 19.01 cc1: all warnings being treated as errors #8 19.01 make[1]: *** [objs/Makefile:1625: objs/addon/store/spool.o] Error 1 #8 19.01 make[1]: Leaving directory '/usr/src/nginx-1.24.0' #8 19.01 make: *** [Makefile:10: build] Error 2 #8 ERROR: process "/bin/sh -c CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') mkdir -p /usr/src && tar -zxC /usr/src -f nginx.tar.gz && tar -xzvf \"nchan.tar.gz\" && NCHANDIR=\"$(pwd)/nchan-${NCHAN_VERSION}\" && cd /usr/src/nginx-$NGINX_VERSION && ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && make && make install" did not complete successfully: exit code: 2 ------ > [builder 4/4] RUN CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') mkdir -p /usr/src && tar -zxC /usr/src -f nginx.tar.gz && tar -xzvf "nchan.tar.gz" && NCHANDIR="$(pwd)/nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7" && cd /usr/src/nginx-1.24.0 && ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && make && make install: #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:305:33: error: unused variable 'nuspool' [-Werror=unused-variable] #8 18.88 305 | subscriber_pool_t *spool, *nuspool; #8 18.88 | ^~~~~~~ #8 18.88 //nchan-bacb7b674899e37b56a277f9d4107d1b14696cd7/src/store/spool.c:304:25: error: variable 'prev_status' set but not used [-Werror=unused-but-set-variable] #8 18.88 304 | nchan_msg_status_t prev_status; #8 18.88 | ^~~~~~~~~~~ #8 19.01 cc1: all warnings being treated as errors #8 19.01 make[1]: *** [objs/Makefile:1625: objs/addon/store/spool.o] Error 1 #8 19.01 make[1]: Leaving directory '/usr/src/nginx-1.24.0' #8 19.01 make: *** [Makefile:10: build] Error 2 ------ Dockerfile:40 -------------------- 39 | # Reuse same cli arguments as the nginx:alpine image used to build 40 | >>> RUN CONFARGS=$(nginx -V 2>&1 | sed -n -e 's/^.*arguments: //p') \ 41 | >>> mkdir -p /usr/src && \ 42 | >>> tar -zxC /usr/src -f nginx.tar.gz && \ 43 | >>> tar -xzvf "nchan.tar.gz" && \ 44 | >>> NCHANDIR="$(pwd)/nchan-${NCHAN_VERSION}" && \ 45 | >>> cd /usr/src/nginx-$NGINX_VERSION && \ 46 | >>> ./configure --with-compat $CONFARGS --add-dynamic-module=$NCHANDIR && \ 47 | >>> make && make install ```

This is a build where NCHAN_VERSION = bacb7b674899e37b56a277f9d4107d1b14696cd7 (Note from reproduction dockerfile, need to remove the v in the wget call since vGIT_HASH isn't a thing)

b-n commented 5 months ago

PR here to fix it @slact https://github.com/slact/nchan/pull/679

I have tested on my local docker build, and all compiles correctly after this small patch.

Also, if you'd like a github action that does builds on master to ensure they compile, I'm more than happy to contribute - Just let me know and I'll put together a PR.

b-n commented 4 months ago

@slact I finally got around to actually implementing the change. It looks like there are no more MSG_INVALID errors coming through for us.

I will keep 👀 on this thread/next realease since our current build targets the commit currently on master. But yeah, good news 😄