Open ivanovv opened 5 years ago
I get the same assertion error (spool->msg_status == MSG_INVALID) when I subscribe to a channel through EventSource under these conditions:
1) the channel was already open 2) the channel has no messages 3) nchan_subscriber_first_message is set to -1
Hi,
I have the same issue. 1.2.5 works, 1.2.6 crashes. Managed to reproduce it, with the following minimal test case:
import requests
while True:
try:
print("Getting...")
hdr = {
"If-Modified-Since": "Fri, 29 Nov 2019 13:51:02 GMT",
"If-None-Match": "-1"
}
r = requests.get("http://localhost:80/sub?id=1", timeout=0.5, headers=hdr)
print("Response:", r)
except requests.exceptions.ReadTimeout:
print("Timeout")
The first GET returns directly (timeout).
The second GET triggers the crash (Assertion failed: (spool->msg_status == MSG_INVALID), function spool_fetch_msg, file nchan/src/store/spool.c, line 479.
)
The 0.5s timeout does not matter, same thing if it is 5s
Seems to be introduced by https://github.com/slact/nchan/commit/82b766cd133c060542c2420b19c13d74de6cc0c6, reverting that and it stops crashing.
@stromnet: Thanks, I'll have this fixed shortly
For the record, this happens with Redis enabled too (previous test was without redis. Same config as below but remove redis stuff).
nginx config used:
worker_processes 1;
daemon off;
error_log /dev/stdout debug;
events {
worker_connections 100;
}
http {
upstream my_redis_server {
nchan_redis_server 127.0.0.1;
}
include mime.types;
default_type application/octet-stream;
sendfile on;
#tcp_nopush on;
#keepalive_timeout 0;
keepalive_timeout 65;
server {
listen 80;
server_name localhost;
location = /sub {
nchan_subscriber;
nchan_channel_id $arg_id;
nchan_redis_pass my_redis_server;
}
location = /pub {
nchan_publisher;
nchan_channel_id $arg_id;
nchan_redis_pass my_redis_server;
}
}
}
I'm having the same issue with 1.2.7
I'm not using redis so I tried compiling from source removing the commit mention by @stromnet in https://github.com/slact/nchan/issues/534#issuecomment-559804806 and it looks that it works fine.
this are my current stats that makes it break, it crashed right away, in just a few seconds
total published messages: 559
stored messages: 263
shared memory used: 224K
shared memory limit: 131072K
channels: 480
subscribers: 939
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 35896
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7
I haven't been able to reproduce it in a local environment, just with a ton of messages in production.
@blackjid can you share your pub
location?
Mine looks like so
location ~ /pub/(.*)$ {
nchan_publisher;
nchan_channel_id $1;
nchan_message_buffer_length 0;
nchan_message_timeout 30s;
nchan_deflate_message_for_websocket on;
}
and I guess nchan_message_buffer_length 0;
has to do something with this bug
We are seeing this issue now in the latest NCHAN/NGINX version with a very simple configuration.
daemon off;
worker_processes auto;
error_log logs/access.log notice;
events {
use epoll;
accept_mutex on;
worker_connections 1024;
}
http {
gzip on;
gzip_comp_level 6;
gzip_min_length 512;
gzip_types text/plain text/css application/json application/x-javascript text/xml application/xml application/xml+rss text/javascript;
gzip_vary on;
gzip_proxied any;
server_tokens off;
log_format l2met 'measure#nginx.service=$request_time request_id=$http_x_request_id';
access_log logs/access.log l2met;
include mime.types;
default_type application/octet-stream;
sendfile on;
client_body_timeout 5;
nchan_shared_memory_size 256M;
server {
listen <PORT>;
server_name _;
charset UTF-8;
port_in_redirect off;
keepalive_timeout 5;
client_max_body_size 1M;
location /nchan_stub_status {
nchan_stub_status;
}
location = /pub {
nchan_publisher;
nchan_channel_id $arg_id;
}
location = /multisub {
nchan_subscriber;
nchan_channel_id $arg_id;
nchan_channel_id_split_delimiter ",";
nchan_websocket_ping_interval 20;
}
}
}
We have relatively low traffic on this, so after 0-5 hours of good, error-free usage, this is our stub status:
# Tue Apr 28 15:59:36 UTC 2020
total published messages: 205
stored messages: 109
shared memory used: 588K
shared memory limit: 262144K
channels: 53
subscribers: 17
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 27056
interprocess alerts in transit: 0
interprocess queued alerts: 0
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7
Then, in our error logs (resulting in #321):
Apr 28 11:00:28 app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 11:00:28 app/web.1 2020/04/28 16:00:28 [notice] 44#0: signal 17 (SIGCHLD) received from 47
Apr 28 11:00:28 app/web.1 2020/04/28 16:00:28 [alert] 44#0: worker process 47 exited on signal 6
Apr 28 11:00:28 app/web.1 2020/04/28 16:00:28 [notice] 44#0: start worker process 56
Apr 28 11:00:28 app/web.1 2020/04/28 16:00:28 [notice] 44#0: signal 29 (SIGIO) received
Apr 28 11:00:49 app/web.1 2020/04/28 16:00:49 [error] 48#0: MEMSTORE:03: multimsg 000055FFA4B8B090 timeout!!
Apr 28 11:00:49 app/web.1 2020/04/28 16:00:49 [error] 48#0: MEMSTORE:03: multimsg 000055FFA4B2A5F0 timeout!!
Apr 28 11:00:50 app/web.1 2020/04/28 16:00:50 [error] 50#0: MEMSTORE:05: multimsg 000055FFA4B2B3B0 timeout!!
Apr 28 11:01:52 app/web.1 2020/04/28 16:01:52 [error] 50#0: MEMSTORE:05: multimsg 000055FFA4BAA7D0 timeout!!
Apr 28 11:02:21 app/web.1 2020/04/28 16:02:20 [error] 48#0: MEMSTORE:03: tried adding WAITING chanhead 000055FFA4BA4E30 /location_371 to chanhead_gc. why?
Apr 28 11:02:42 app/web.1 2020/04/28 16:02:42 [error] 49#0: MEMSTORE:04: multimsg 000055FFA4B2B670 timeout!!
Apr 28 11:03:01 app/web.1 2020/04/28 16:03:01 [error] 49#0: MEMSTORE:04: multimsg 000055FFA4B90BC0 timeout!!
Apr 28 11:03:13 app/web.1 2020/04/28 16:03:12 [error] 49#0: MEMSTORE:04: multimsg 000055FFA4B91C40 timeout!!
... and this continues until restart
Causing interprocess alerts to pile up:
# Tue Apr 28 16:00:36 UTC 2020
total published messages: 207
stored messages: 97
shared memory used: 564K
shared memory limit: 262144K
channels: 54
subscribers: 20
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 27467
interprocess alerts in transit: 24
interprocess queued alerts: 24
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7
# Tue Apr 28 16:01:36 UTC 2020
total published messages: 209
stored messages: 98
shared memory used: 572K
shared memory limit: 262144K
channels: 56
subscribers: 20
redis pending commands: 0
redis connected servers: 0
total interprocess alerts received: 27467
interprocess alerts in transit: 27
interprocess queued alerts: 27
total interprocess send delay: 0
total interprocess receive delay: 0
nchan version: 1.2.7
Happy to share additional info. Thanks!
@andjosh you can at least mitigate this with
worker_processes 1;
That way if worker dies it will be restarted by the master and I bet nobody will even notice.
That works well until you hit the CPU/connection limit of 1 worker (it either uses 100% of 1 core or has connections opened equal to worker_connections
and won't be able to handle new connections. Just for the reference we are speaking about 15K+ connections here, so you will have some room to grow.
Also, thanks for sharing the config, it shows that this bug has nothing to do with
nchan_message_buffer_length 0;
like I thought before.
@ivanovv Unfortunately that does not seem to help.
Just tried that now, and all that happens is that the one worker process continuously crashes every second:
....
Apr 28 16:16:14 app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 16:16:14 app/web.1 2020/04/28 21:16:14 [notice] 44#0: signal 17 (SIGCHLD) received from 327
Apr 28 16:16:14 app/web.1 2020/04/28 21:16:14 [alert] 44#0: worker process 327 exited on signal 6
Apr 28 16:16:14 app/web.1 2020/04/28 21:16:14 [notice] 44#0: start worker process 328
Apr 28 16:16:14 app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 16:16:14 app/web.1 2020/04/28 21:16:14 [notice] 44#0: signal 17 (SIGCHLD) received from 328
Apr 28 16:16:14 app/web.1 2020/04/28 21:16:14 [alert] 44#0: worker process 328 exited on signal 6
Apr 28 16:16:14 app/web.1 2020/04/28 21:16:14 [notice] 44#0: start worker process 329
Apr 28 16:16:15 app/web.1 rocess: nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
Apr 28 16:16:15 app/web.1 2020/04/28 21:16:14 [notice] 44#0: signal 17 (SIGCHLD) received from 329
Apr 28 16:16:15 app/web.1 2020/04/28 21:16:14 [alert] 44#0: worker process 329 exited on signal 6
Apr 28 16:16:15 app/web.1 2020/04/28 21:16:14 [notice] 44#0: start worker process 330
... and on and on
@andjosh interesting, I was under impression that worker restart should clear things up.
At least I can say that worker_count=1 can guarantee you that you won't find nchan stuck in not working
state.
As for this issue I suggest you using nchan 1.2.2.
Any news on this issue? Right now nchan
is unusable because of this, nginx
has to be restarted every 5 minutes because push gets stuck. Error message:
ker process: ../nchan-1.2.7/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2020/10/01 13:52:06 [alert] 15265#15265: worker process 15267 exited on signal 6
Configuration:
#PUSH publish
location = /publish {
allow 127.0.0.1;
deny all;
nchan_publisher;
nchan_channel_group $arg_group;
nchan_channel_id $arg_id;
nchan_message_buffer_length 0;
nchan_store_messages off;
}
# PUSH subscriber
location = /push {
nchan_subscriber eventsource;
nchan_eventsource_ping_interval 60;
nchan_subscriber_first_message newest;
nchan_channel_group $arg_group;
nchan_channel_id $arg_id;
}
Why not revert to an older version of nchan
? That's much better solution then restarting nginx
Why not revert to an older version of
nchan
? That's much better solution then restartingnginx
Older version does not support event-source ping which is required for us otherwise devices will timeout constantly.
then the only thing I can suggest is going the same route that is described in this https://github.com/slact/nchan/issues/534#issuecomment-559804806
I met the issue on v1.2.6 and v1.2.7. Is the root cause identified? Will you provide the fix in the next release? Thanks so much! @slact
Will you provide the fix in the next release?
As I understand it there will be no next release
, Leo is building a standalone version of nchan.
This issue is more than 1 year old, you see.
still an error in nchan 1.2.8
I can confim the issue exists on v1.2.8 and v1.2.10 with same logs:
/root/nginxinstall/nchan-1.2.10/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2021/09/16 17:56:17 [alert] 107776#107776: worker process 107777 exited on signal 6 (core dumped)
Using worker_count=1
is not good option, every request like what @stromnet mentioned is make the nginx server fall in crash loop.
This config suits for me and fixed the issue by removing those headers:
location ~ /suborig/([:\w]+)$ {
nchan_subscriber eventsource;
nchan_channel_id $1;
nchan_store_messages off;
nchan_message_timeout 500ms;
nchan_eventsource_ping_interval 1;
nchan_eventsource_ping_data 'pong';
nchan_subscriber_timeout 1h;
}
location ~ /sub/([:\w]+)$ {
proxy_pass http://127.0.0.1/suborig/$1;
proxy_set_header If-None-Match '';
proxy_set_header If-Modified-Since '';
proxy_set_header Host $http_host;
proxy_set_header Connection '';
proxy_http_version 1.1;
proxy_redirect off;
chunked_transfer_encoding off;
proxy_no_cache 1;
proxy_cache_bypass 1;
proxy_buffering off;
}
@slact Can we hope to fix this issue before 2022?
My previous try is just for one situation that described by @stromnet. We got another unknown type of request that raises this error again!
@slact Asserting a request/connection must not break the whole web server.
any update ?
Still facing this issue. I have compiled my nginx, 1.20.2, with following command:
./configure --prefix=/usr/local/nginx --add-module=plugins/headers-more-nginx-module-0.33 --add-module=plugins/nchan-1.2.15 --with-zlib=plugins/zlib-1.2.11 --with-http_ssl_module
Error output on the errors.log
2022/01/04 04:09:55 [error] 2242#0: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload
2022/01/04 04:09:55 [error] 2242#0: SUB:WEBSOCKET:ws_recv NOT OK when receiving payload
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:13:57 [alert] 1#0: worker process 2242 exited on signal 6 (core dumped)
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:13:59 [alert] 1#0: worker process 2243 exited on signal 6 (core dumped)
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:14:01 [alert] 1#0: worker process 2244 exited on signal 6 (core dumped)
plugins/nchan-1.2.15/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed.
2022/01/04 04:14:03 [alert] 1#0: worker process 2245 exited on signal 6 (core dumped)
My server config
# following resolved to 4 child process per each core on the server
worker_processes auto;
pid logs/nginx.pid;
worker_rlimit_nofile 1048576;
events {
worker_connections 48000;
}
http {
include mime.types;
default_type application/octet-stream;
sendfile on;
keepalive_timeout 65;
gzip on;
proxy_cache off;
proxy_read_timeout 300s;
nchan_shared_memory_size 2048M;
upstream redis_server {
nchan_redis_server redis;
}
server {
resolver 127.0.0.11;
server_name *.xconvert.com default_server localhost;
listen 80;
listen 443 ssl;
ssl_certificate /usr/local/nginx/bundle.cer;
ssl_certificate_key /usr/local/nginx/pk.key;
client_max_body_size 2000M;
client_body_buffer_size 50M;
#charset koi8-r;
access_log off;
error_log syslog:server=ncf-ec2-east-14-hv.xconvert.com:1516 warn;
location ~ /smq/([\w-_]+)$ {
nchan_subscriber;
nchan_channel_group user_group;
nchan_channel_id $1;
nchan_redis_pass redis_server;
nchan_message_buffer_length 300;
nchan_message_timeout 60m;
}
location ~ /_channel_id/([\w-_]+)$ {
nchan_publisher;
nchan_channel_group user_group;
nchan_channel_id $1;
nchan_redis_pass redis_server;
nchan_message_buffer_length 300;
nchan_message_timeout 60m;
}
.... Other stuff
}
}
Seems this project may be dead. I found reverting to nchan version: 1.2.5 was the only remedy for this and another update problem https://github.com/slact/nchan/issues/578
@slact the current version is 1.2.15 but I still get this error when using it as a dynamic module with nginx 1.21.6. And it was like this with 1.20.* too. On Ubuntu server. Built from source.
2022/03/12 16:51:29 [info] 46964#46964: Using 131072KiB of shared memory for nchan in /etc/nginx/nginx.conf:66 ker process: ../nchan/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed. 2022/03/12 16:51:30 [alert] 46977#46977: worker process 46978 exited on signal 6 (core dumped) 2022/03/12 16:51:30 [alert] 46977#46977: worker process 47016 exited on signal 11 (core dumped) 2022/03/12 16:51:31 [alert] 46977#46977: worker process 47020 exited on signal 11 (core dumped)
nchan 1.2.5 works with nginx 1.21.6 as a dynamic module
@slact I've encountered the same issue. Version 1.2.5 works good but when I switch to the newest version I get this error in error.log nginx: worker process: ../nchan/src/store/spool.c:479: spool_fetch_msg: Assertion `spool->msg_status == MSG_INVALID' failed. 2022/03/13 16:22:54 [alert] 53#53: worker process 61 exited on signal 6
And nchan is not sending any messages to subscribers. Any solution to this? I want to use the newest version but it seems that only version 1.2.5 works.
We're experiencing this issue with the version of NChan that shipped with Ubuntu 22.04 (Jammy). This has inadvertently caused us to lose stability when upgrading to the newest version of Ubuntu. Currently, the only workaround appears to be building an older version of the library from source. It would be very preferable if this was fixed in this repo.
How is this still not fixed after 4 years...
@Kurasami Because I fixed the issue I was seeing that led to this crash. If there is another, I have been unable to replicate it. If you want this issue fixed, i need the following information:
@slact I've found a way to reproduce (I think) this crash. I've documented it here: https://github.com/slact/nchan/issues/676#issuecomment-1873965660
Only commenting here to keep these issues linked (I think they are related)
I switched from the version included in Ubuntu/Debian, and built from source, and the problem went away 🤔
I've updated nchan from
to latest 1.2.6.
While we had some issues with nginx worker crashing in the past, it's been quite stable in last 3-4months, no crashes at all.
Now, in just about 30 minutes after version change, the worker crashed Here are the details (though, as I used compiled deb packages, there are less details than usual)
nginx error log:
FYI: this is the second dump of the two, I still have this in
core_pattern
I'm about to revert my version change for now, let me know if you need anything else.