slact / nchan

Fast, horizontally scalable, multiprocess pub/sub queuing server and proxy for HTTP, long-polling, Websockets and EventSource (SSE), powered by Nginx.
https://nchan.io/
Other
3.01k stars 293 forks source link

Worker crash 'spool->msg_status == MSG_INVALID' #534

Open ivanovv opened 5 years ago

ivanovv commented 5 years ago

I've updated nchan from

nchan built from source, commit sha d65a5c3

to latest 1.2.6.

While we had some issues with nginx worker crashing in the past, it's been quite stable in last 3-4months, no crashes at all.

Now, in just about 30 minutes after version change, the worker crashed Here are the details (though, as I used compiled deb packages, there are less details than usual)

$> lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty
$> uname -a
Linux app.example.com 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$> cat /proc/version
Linux version 4.4.0-31-generic (buildd@lgw01-43) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) ) #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016
$> nginx -V
nginx version: nginx/1.10.1
built with OpenSSL 1.0.1f 6 Jan 2014
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-ipv6 --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_v2_module --with-http_dav_module --with-threads --with-http_addition_module --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module --with-http_mp4_module --with-http_random_index_module --with-http_secure_link_module --with-http_sub_module --with-http_xslt_module --with-mail --with-mail_ssl_module --with-stream --with-stream_ssl_module --add-module=/build/nginx/debian/modules/headers-more-nginx-module --add-module=/build/nginx/debian/modules/nginx-auth-pam --add-module=/build/nginx/debian/modules/nginx-cache-purge --add-module=/build/nginx/debian/modules/nginx-dav-ext-module --add-module=/build/nginx/debian/modules/nginx-development-kit --add-module=/build/nginx/debian/modules/nginx-echo --add-module=/build/nginx/debian/modules/ngx-fancyindex --add-module=/build/nginx/debian/modules/nchan --add-module=/build/nginx/debian/modules/nginx-lua --add-module=/build/nginx/debian/modules/nginx-upload-progress --add-module=/build/nginx/debian/modules/nginx-upstream-fair --add-module=/build/nginx/debian/modules/ngx_http_substitutions_filter_module
$> curl -X GET -H "Accept: text/json" http://10.xxx.xxx.90/nchan_stub_status
total published messages: 1292
stored messages: 9
shared memory used: 32K
shared memory limit: 131072K
channels: 25
subscribers: 8
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 0
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.6

nginx error log:

ker process: /build/nginx/debian/modules/nchan/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2019/06/29 13:12:37 [notice] 10880#10880: signal 17 (SIGCHLD) received
2019/06/29 13:12:37 [alert] 10880#10880: worker process 10881 exited on signal 6 (core dumped)
2019/06/29 13:12:37 [notice] 10880#10880: start worker process 12335
ker process: /build/nginx/debian/modules/nchan/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2019/06/29 13:15:05 [notice] 10880#10880: signal 17 (SIGCHLD) received
2019/06/29 13:15:05 [alert] 10880#10880: worker process 12335 exited on signal 6 (core dumped)
2019/06/29 13:15:05 [notice] 10880#10880: start worker process 12438
$> gdb /usr/sbin/nginx  CoreDump 
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/nginx...(no debugging symbols found)...done.
[New LWP 10881]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `nginx: worker process                           '.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f9e513e5c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0  0x00007f9e513e5c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f9e513e9028 in __GI_abort () at abort.c:89
#2  0x00007f9e513debf6 in __assert_fail_base (fmt=0x7f9e51533058 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x595248 "spool->msg_status == MSG_INVALID", 
    file=file@entry=0x594e90 "/build/nginx/debian/modules/nchan/src/store/spool.c", line=line@entry=479, 
    function=function@entry=0x5955d0 "spool_fetch_msg") at assert.c:92
#3  0x00007f9e513deca2 in __GI___assert_fail (assertion=0x595248 "spool->msg_status == MSG_INVALID", 
    file=0x594e90 "/build/nginx/debian/modules/nchan/src/store/spool.c", line=479, function=0x5955d0 "spool_fetch_msg") at assert.c:101
#4  0x000000000050e4d8 in ?? ()
#5  0x000000000050debf in ?? ()
#6  0x00000000005170a2 in ?? ()
#7  0x000000000051ac83 in nchan_memstore_handle_get_message_reply ()
#8  0x000000000051ae85 in ?? ()
#9  0x000000000051b1f4 in ?? ()
#10 0x000000000051ad71 in ?? ()
#11 0x000000000050e313 in ?? ()
#12 0x000000000050ec28 in ?? ()
#13 0x0000000000519d51 in ?? ()
#14 0x000000000050281c in nchan_subscriber_subscribe ()
#15 0x0000000000502bf5 in nchan_subscriber_authorize_subscribe_request ()
#16 0x00000000004ef756 in nchan_pubsub_handler ()
#17 0x0000000000461a4f in ngx_http_core_content_phase ()
#18 0x000000000045bf55 in ngx_http_core_run_phases ()
#19 0x0000000000467e83 in ngx_http_process_request ()
#20 0x0000000000468836 in ?? ()
#21 0x0000000000450489 in ?? ()
#22 0x0000000000446517 in ngx_process_events_and_timers ()
#23 0x000000000044de55 in ?? ()
#24 0x000000000044c940 in ngx_spawn_process ()
#25 0x000000000044e0c4 in ?? ()
#26 0x000000000044edff in ngx_master_process_cycle ()
#27 0x0000000000427c8f in main ()

FYI: this is the second dump of the two, I still have this in core_pattern

|/usr/share/apport/apport %p %s %c %d %P

I'm about to revert my version change for now, let me know if you need anything else.

binhanle commented 5 years ago

I get the same assertion error (spool->msg_status == MSG_INVALID) when I subscribe to a channel through EventSource under these conditions:

1) the channel was already open 2) the channel has no messages 3) nchan_subscriber_first_message is set to -1

stromnet commented 4 years ago

Hi,

I have the same issue. 1.2.5 works, 1.2.6 crashes. Managed to reproduce it, with the following minimal test case:

import requests

while True:
    try:
        print("Getting...")
        hdr = {
                "If-Modified-Since": "Fri, 29 Nov 2019 13:51:02 GMT",
                "If-None-Match": "-1"
            }
        r = requests.get("http://localhost:80/sub?id=1", timeout=0.5, headers=hdr)
        print("Response:", r)
    except requests.exceptions.ReadTimeout:
        print("Timeout")

The first GET returns directly (timeout). The second GET triggers the crash (Assertion failed: (spool->msg_status == MSG_INVALID), function spool_fetch_msg, file nchan/src/store/spool.c, line 479.)

The 0.5s timeout does not matter, same thing if it is 5s

stromnet commented 4 years ago

Seems to be introduced by https://github.com/slact/nchan/commit/82b766cd133c060542c2420b19c13d74de6cc0c6, reverting that and it stops crashing.

slact commented 4 years ago

@stromnet: Thanks, I'll have this fixed shortly

stromnet commented 4 years ago

For the record, this happens with Redis enabled too (previous test was without redis. Same config as below but remove redis stuff).

nginx config used:

worker_processes  1;
daemon off;

error_log /dev/stdout debug;

events {
  worker_connections  100;
}

http {
  upstream my_redis_server {
    nchan_redis_server 127.0.0.1;
  }
  include       mime.types;
  default_type  application/octet-stream;
  sendfile        on;
  #tcp_nopush     on;
  #keepalive_timeout  0;
  keepalive_timeout  65;
  server {
    listen       80;
    server_name  localhost;
    location = /sub {
      nchan_subscriber;
      nchan_channel_id $arg_id;
      nchan_redis_pass my_redis_server;
    }
    location = /pub {
      nchan_publisher;
      nchan_channel_id $arg_id;
      nchan_redis_pass my_redis_server;
    }
  }
}
blackjid commented 4 years ago

I'm having the same issue with 1.2.7 I'm not using redis so I tried compiling from source removing the commit mention by @stromnet in https://github.com/slact/nchan/issues/534#issuecomment-559804806 and it looks that it works fine.

this are my current stats that makes it break, it crashed right away, in just a few seconds

total published messages: 559
stored messages: 263
shared memory used: 224K
shared memory limit: 131072K
channels: 480
subscribers: 939
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 35896
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7

I haven't been able to reproduce it in a local environment, just with a ton of messages in production.

ivanovv commented 4 years ago

@blackjid can you share your pub location?

Mine looks like so

  location ~ /pub/(.*)$ {
    nchan_publisher;
    nchan_channel_id $1;
    nchan_message_buffer_length 0;
    nchan_message_timeout 30s;
    nchan_deflate_message_for_websocket on;
  }

and I guess nchan_message_buffer_length 0; has to do something with this bug

joshbeckman commented 4 years ago

We are seeing this issue now in the latest NCHAN/NGINX version with a very simple configuration.

Configuration

daemon off;
worker_processes auto;
error_log logs/access.log notice;

events {
    use epoll;
    accept_mutex on;
    worker_connections 1024;
}

http {
        gzip on;
        gzip_comp_level 6;
        gzip_min_length 512;
        gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;
        gzip_vary on;
        gzip_proxied any;

        server_tokens off;

    log_format l2met 'measure#nginx.service=$request_time request_id=$http_x_request_id';
    access_log logs/access.log l2met;

    include mime.types;
    default_type application/octet-stream;
    sendfile on;

    client_body_timeout 5;

        nchan_shared_memory_size 256M;
        server {
        listen <PORT>;
        server_name _;
                charset UTF-8;
                port_in_redirect off;
        keepalive_timeout 5;
                client_max_body_size 1M;

                location /nchan_stub_status {
                        nchan_stub_status;
                }

                location = /pub {
                        nchan_publisher;
                        nchan_channel_id $arg_id;
                }

                location = /multisub {
                        nchan_subscriber;
                        nchan_channel_id $arg_id;
                        nchan_channel_id_split_delimiter ",";
                        nchan_websocket_ping_interval 20;
                }
    }
}

We have relatively low traffic on this, so after 0-5 hours of good, error-free usage, this is our stub status:

# Tue Apr 28 15:59:36 UTC 2020
total published messages: 205
stored messages: 109
shared memory used: 588K
shared memory limit: 262144K
channels: 53
subscribers: 17
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 27056
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7

Then, in our error logs (resulting in #321):

Apr 28 11:00:28  app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 11:00:28  app/web.1 2020/04/28 16:00:28 [notice] 44#0: signal 17 (SIGCHLD) received from 47
Apr 28 11:00:28  app/web.1 2020/04/28 16:00:28 [alert] 44#0: worker process 47 exited on signal 6
Apr 28 11:00:28  app/web.1 2020/04/28 16:00:28 [notice] 44#0: start worker process 56
Apr 28 11:00:28  app/web.1 2020/04/28 16:00:28 [notice] 44#0: signal 29 (SIGIO) received
Apr 28 11:00:49  app/web.1 2020/04/28 16:00:49 [error] 48#0: MEMSTORE:03: multimsg 000055FFA4B8B090 timeout!!
Apr 28 11:00:49  app/web.1 2020/04/28 16:00:49 [error] 48#0: MEMSTORE:03: multimsg 000055FFA4B2A5F0 timeout!!
Apr 28 11:00:50  app/web.1 2020/04/28 16:00:50 [error] 50#0: MEMSTORE:05: multimsg 000055FFA4B2B3B0 timeout!!
Apr 28 11:01:52  app/web.1 2020/04/28 16:01:52 [error] 50#0: MEMSTORE:05: multimsg 000055FFA4BAA7D0 timeout!!
Apr 28 11:02:21  app/web.1 2020/04/28 16:02:20 [error] 48#0: MEMSTORE:03: tried adding WAITING chanhead 000055FFA4BA4E30 /location_371 to chanhead_gc. why?
Apr 28 11:02:42  app/web.1 2020/04/28 16:02:42 [error] 49#0: MEMSTORE:04: multimsg 000055FFA4B2B670 timeout!!
Apr 28 11:03:01  app/web.1 2020/04/28 16:03:01 [error] 49#0: MEMSTORE:04: multimsg 000055FFA4B90BC0 timeout!!
Apr 28 11:03:13  app/web.1 2020/04/28 16:03:12 [error] 49#0: MEMSTORE:04: multimsg 000055FFA4B91C40 timeout!!
... and this continues until restart

Causing interprocess alerts to pile up:

# Tue Apr 28 16:00:36 UTC 2020
total published messages: 207
stored messages: 97
shared memory used: 564K
shared memory limit: 262144K
channels: 54
subscribers: 20
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 27467
interprocess alerts in transit: 24
interprocess queued alerts: 24
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7

# Tue Apr 28 16:01:36 UTC 2020
total published messages: 209
stored messages: 98
shared memory used: 572K
shared memory limit: 262144K
channels: 56
subscribers: 20
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 27467
interprocess alerts in transit: 27
interprocess queued alerts: 27
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7

Happy to share additional info. Thanks!

ivanovv commented 4 years ago

@andjosh you can at least mitigate this with

worker_processes 1;

That way if worker dies it will be restarted by the master and I bet nobody will even notice.

That works well until you hit the CPU/connection limit of 1 worker (it either uses 100% of 1 core or has connections opened equal to worker_connections and won't be able to handle new connections. Just for the reference we are speaking about 15K+ connections here, so you will have some room to grow.

Also, thanks for sharing the config, it shows that this bug has nothing to do with

    nchan_message_buffer_length 0;

like I thought before.

joshbeckman commented 4 years ago

@ivanovv Unfortunately that does not seem to help.

Just tried that now, and all that happens is that the one worker process continuously crashes every second:

....
Apr 28 16:16:14  app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 16:16:14  app/web.1 2020/04/28 21:16:14 [notice] 44#0: signal 17 (SIGCHLD) received from 327
Apr 28 16:16:14  app/web.1 2020/04/28 21:16:14 [alert] 44#0: worker process 327 exited on signal 6
Apr 28 16:16:14  app/web.1 2020/04/28 21:16:14 [notice] 44#0: start worker process 328
Apr 28 16:16:14  app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 16:16:14  app/web.1 2020/04/28 21:16:14 [notice] 44#0: signal 17 (SIGCHLD) received from 328
Apr 28 16:16:14  app/web.1 2020/04/28 21:16:14 [alert] 44#0: worker process 328 exited on signal 6
Apr 28 16:16:14  app/web.1 2020/04/28 21:16:14 [notice] 44#0: start worker process 329
Apr 28 16:16:15  app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 16:16:15  app/web.1 2020/04/28 21:16:14 [notice] 44#0: signal 17 (SIGCHLD) received from 329
Apr 28 16:16:15  app/web.1 2020/04/28 21:16:14 [alert] 44#0: worker process 329 exited on signal 6
Apr 28 16:16:15  app/web.1 2020/04/28 21:16:14 [notice] 44#0: start worker process 330
... and on and on
ivanovv commented 4 years ago

@andjosh interesting, I was under impression that worker restart should clear things up.

At least I can say that worker_count=1 can guarantee you that you won't find nchan stuck in not working state.

As for this issue I suggest you using nchan 1.2.2.

masarykadam commented 4 years ago

Any news on this issue? Right now nchan is unusable because of this, nginx has to be restarted every 5 minutes because push gets stuck. Error message:

ker process: ../nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2020/10/01 13:52:06 [alert] 15265#15265: worker process 15267 exited on signal 6

Configuration:

#PUSH publish
location = /publish {
    allow 127.0.0.1;
    deny all;
    nchan_publisher;
    nchan_channel_group $arg_group;
    nchan_channel_id $arg_id;
    nchan_message_buffer_length 0;
    nchan_store_messages off;
}

# PUSH subscriber
location = /push {
    nchan_subscriber eventsource;
    nchan_eventsource_ping_interval 60;
    nchan_subscriber_first_message newest;
    nchan_channel_group $arg_group;
    nchan_channel_id $arg_id;
}
ivanovv commented 4 years ago

Why not revert to an older version of nchan? That's much better solution then restarting nginx

masarykadam commented 4 years ago

Why not revert to an older version of nchan? That's much better solution then restarting nginx

Older version does not support event-source ping which is required for us otherwise devices will timeout constantly.

ivanovv commented 4 years ago

then the only thing I can suggest is going the same route that is described in this https://github.com/slact/nchan/issues/534#issuecomment-559804806

xixixihaha commented 3 years ago

I met the issue on v1.2.6 and v1.2.7. Is the root cause identified? Will you provide the fix in the next release? Thanks so much! @slact

ivanovv commented 3 years ago

Will you provide the fix in the next release?

As I understand it there will be no next release, Leo is building a standalone version of nchan.

This issue is more than 1 year old, you see.

binhbo12456 commented 3 years ago

still an error in nchan 1.2.8

meyt commented 3 years ago

I can confim the issue exists on v1.2.8 and v1.2.10 with same logs:

/root/nginxinstall/nchan-1.2.10/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2021/09/16 17:56:17 [alert] 107776#107776: worker process 107777 exited on signal 6 (core dumped)

Using worker_count=1 is not good option, every request like what @stromnet mentioned is make the nginx server fall in crash loop.

This config suits for me and fixed the issue by removing those headers:

  location ~ /suborig/([:\w]+)$ {
    nchan_subscriber eventsource;
    nchan_channel_id $1;
    nchan_store_messages off;
    nchan_message_timeout 500ms;
    nchan_eventsource_ping_interval 1;
    nchan_eventsource_ping_data 'pong';
    nchan_subscriber_timeout 1h;
  }

  location ~ /sub/([:\w]+)$ {
    proxy_pass http://127.0.0.1/suborig/$1;
    proxy_set_header If-None-Match '';
    proxy_set_header If-Modified-Since '';
    proxy_set_header Host $http_host;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    proxy_redirect off;
    chunked_transfer_encoding off;
    proxy_no_cache 1;
    proxy_cache_bypass 1;
    proxy_buffering off;
  }

@slact Can we hope to fix this issue before 2022?

meyt commented 3 years ago

My previous try is just for one situation that described by @‌stromnet. We got another unknown type of request that raises this error again!

@slact Asserting a request/connection must not break the whole web server.

binhbo12456 commented 2 years ago

any update ?

harithay commented 2 years ago

Still facing this issue. I have compiled my nginx, 1.20.2, with following command:

./configure  --prefix=/usr/local/nginx --add-module=plugins/headers-more-nginx-module-0.33 --add-module=plugins/nchan-1.2.15 --with-zlib=plugins/zlib-1.2.11 --with-http_ssl_module 

Error output on the errors.log

2022/01/04 04:09:55 [error] 2242#0: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload
2022/01/04 04:09:55 [error] 2242#0: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:13:57 [alert] 1#0: worker process 2242 exited on signal 6 (core dumped)
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:13:59 [alert] 1#0: worker process 2243 exited on signal 6 (core dumped)
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:14:01 [alert] 1#0: worker process 2244 exited on signal 6 (core dumped)
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:14:03 [alert] 1#0: worker process 2245 exited on signal 6 (core dumped)

My server config


# following resolved to 4 child process per each core on the server
worker_processes auto;

pid logs/nginx.pid;
worker_rlimit_nofile 1048576;

events {
    worker_connections 48000;
}

http {
    include mime.types;
    default_type application/octet-stream;
    sendfile on;

    keepalive_timeout 65;
    gzip on;

    proxy_cache off;
    proxy_read_timeout 300s;
    nchan_shared_memory_size 2048M;

    upstream redis_server {
        nchan_redis_server redis;
    }
    server {
        resolver 127.0.0.11;
        server_name *.xconvert.com default_server localhost;
        listen 80;
        listen 443 ssl;
        ssl_certificate /usr/local/nginx/bundle.cer;
        ssl_certificate_key /usr/local/nginx/pk.key;
        client_max_body_size 2000M;
        client_body_buffer_size 50M;
        #charset koi8-r;
        access_log off;
        error_log syslog:server=ncf-ec2-east-14-hv.xconvert.com:1516 warn;

        location ~ /smq/([\w-_]+)$ {
            nchan_subscriber;
            nchan_channel_group user_group;
            nchan_channel_id $1; 
            nchan_redis_pass redis_server;
            nchan_message_buffer_length 300;
            nchan_message_timeout 60m;
        }
        location ~ /_channel_id/([\w-_]+)$ {
            nchan_publisher;
            nchan_channel_group user_group;
            nchan_channel_id $1;
            nchan_redis_pass redis_server;
            nchan_message_buffer_length 300;
            nchan_message_timeout 60m;
        }

        .... Other stuff
    }

}
tpneumat commented 2 years ago

Seems this project may be dead. I found reverting to nchan version: 1.2.5 was the only remedy for this and another update problem https://github.com/slact/nchan/issues/578

andrisi commented 2 years ago

@slact the current version is 1.2.15 but I still get this error when using it as a dynamic module with nginx 1.21.6. And it was like this with 1.20.* too. On Ubuntu server. Built from source.

2022/03/12 16:51:29 [info] 46964#46964: Using 131072KiB of shared memory for nchan in /etc/nginx/nginx.conf:66 ker process: ../nchan/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed. 2022/03/12 16:51:30 [alert] 46977#46977: worker process 46978 exited on signal 6 (core dumped) 2022/03/12 16:51:30 [alert] 46977#46977: worker process 47016 exited on signal 11 (core dumped) 2022/03/12 16:51:31 [alert] 46977#46977: worker process 47020 exited on signal 11 (core dumped)

andrisi commented 2 years ago

nchan 1.2.5 works with nginx 1.21.6 as a dynamic module

romanspotpower commented 2 years ago

@slact I've encountered the same issue. Version 1.2.5 works good but when I switch to the newest version I get this error in error.log nginx: worker process: ../nchan/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed. 2022/03/13 16:22:54 [alert] 53#53: worker process 61 exited on signal 6

And nchan is not sending any messages to subscribers. Any solution to this? I want to use the newest version but it seems that only version 1.2.5 works.

BusterNeece commented 1 year ago

We're experiencing this issue with the version of NChan that shipped with Ubuntu 22.04 (Jammy). This has inadvertently caused us to lose stability when upgrading to the newest version of Ubuntu. Currently, the only workaround appears to be building an older version of the library from source. It would be very preferable if this was fixed in this repo.

Kurasami commented 1 year ago

How is this still not fixed after 4 years...

slact commented 1 year ago

@Kurasami Because I fixed the issue I was seeing that led to this crash. If there is another, I have been unable to replicate it. If you want this issue fixed, i need the following information:

b-n commented 9 months ago

@slact I've found a way to reproduce (I think) this crash. I've documented it here: https://github.com/slact/nchan/issues/676#issuecomment-1873965660

Only commenting here to keep these issues linked (I think they are related)

andrisi commented 9 months ago

I switched from the version included in Ubuntu/Debian, and built from source, and the problem went away 🤔