slact / nchan

Fast, horizontally scalable, multiprocess pub/sub queuing server and proxy for HTTP, long-polling, Websockets and EventSource (SSE), powered by Nginx.
https://nchan.io/
Other
3k stars 292 forks source link

SUB:WEBSOCKET:fsub->tmp_pool_use_count > 0 when deleting tmp_pool. this may cause trouble... #437

Closed indrekj closed 6 years ago

indrekj commented 6 years ago

nchan: v1.1.14 nginx: v1.13.8

172.24.57.177 - - [31/Jan/2018:14:04:05 +0000] "POST /publish?channel_id=d8ef4095-b002-4b80-9e70-4c132eaa2df2 HTTP/1.1" 201 103 "-" "hackney/1.10.1"
2018/01/31 14:07:38 [error] 13#13: SUB:WEBSOCKET:fsub->tmp_pool_use_count > 0 when deleting tmp_pool. this may cause trouble...
172.24.5.0 - - [31/Jan/2018:14:07:38 +0000] "GET /?channel_id=d8ef4095-b002-4b80-9e70-4c132eaa2df2,0906126b-7f21-4dbd-87d2-d55c0cb93b47 HTTP/1.1" 499 1659 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"
2018/01/31 14:12:38 [notice] 7#7: signal 17 (SIGCHLD) received from 13
2018/01/31 14:12:38 [alert] 7#7: worker process 13 exited on signal 11 (core dumped)
2018/01/31 14:12:38 [notice] 7#7: start worker process 14

This happens time to time. The server currently has quite low traffic.

This usually ends up with:

2018/01/31 14:29:15 [notice] 7#7: signal 17 (SIGCHLD) received from 18
2018/01/31 14:29:15 [alert] 7#7: worker process 18 exited on signal 9
2018/01/31 14:29:15 [alert] 7#7: fork() failed while spawning "worker process" (12: Out of memory)
2018/01/31 14:29:15 [alert] 7#7: could not respawn worker process
indrekj commented 6 years ago

Also:

2018/02/01 08:33:45 [error] 13#13: SUB:WEBSOCKET:fsub->tmp_pool_use_count > 0 when deleting tmp_pool. this may cause trouble...
172.24.57.0 - - [01/Feb/2018:08:33:45 +0000] "GET /?channel_id=ca5aab76-84b8-4ab6-bb02-0b282a963bcc,2711ef27-32e4-463e-91e8-21690a186182 HTTP/1.1" 499 680 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"
2018/02/01 08:33:59 [notice] 7#7: signal 17 (SIGCHLD) received from 13
2018/02/01 08:33:59 [alert] 7#7: worker process 13 exited on signal 11 (core dumped)
2018/02/01 08:33:59 [notice] 7#7: start worker process 16
2018/02/01 08:33:59 [notice] 7#7: signal 29 (SIGIO) received
2018/02/01 08:34:00 [notice] 16#16: nchan: Received 1 interprocess alert delayed by 9 sec.
2018/02/01 08:35:22 [error] 14#14: MEMSTORE:01: multimsg 00005564E1964DA0 timeout!!

MEMSTORE:01 part seems new

slact commented 6 years ago

That's definitely a bug. How easy is it for you to replicate it? Could you provide me the coredump of the process that crashed right after the SUB:WEBSOCKET:fsub->tmp_pool_use_count > 0 error? I also need: ulimit -a uname -a your nginx.conf (edit out whatever you need to keep private) nginx -V

If you'd like, it may be faster to debug this directly on your server -- email me at leo@nchan.io to set that up.

indrekj commented 6 years ago

ulimit -a:

/ # ulimit -a
-f: file size (blocks)             unlimited
-t: cpu time (seconds)             unlimited
-d: data seg size (kb)             unlimited
-s: stack size (kb)                8192
-c: core file size (blocks)        unlimited
-m: resident set size (kb)         unlimited
-l: locked memory (kb)             64
-p: processes                      unlimited
-n: file descriptors               1048576
-v: address space (kb)             unlimited
-w: locks                          unlimited
-e: scheduling priority            0
-r: real-time priority             0

uname -a

/ # uname -a
Linux nchan-3317647813-dhv4j 4.14.11-coreos #1 SMP Fri Jan 5 11:00:14 UTC 2018 x86_64 Linux

nginx -V

/ # nginx -V
nginx version: nginx/1.13.8
built by gcc 6.4.0 (Alpine 6.4.0)
built with OpenSSL 1.0.2n  7 Dec 2017
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-http_auth_request_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_geoip_module=dynamic --with-http_perl_module=dynamic --with-threads --with-stream --with-stream_ssl_module --with-http_slice_module --with-mail --with-mail_ssl_module --with-file-aio --with-http_v2_module --with-ipv6 --add-dynamic-module=/usr/src/nchan-1.1.14

nginx config:

daemon off;

user nginx;
worker_processes 1;

error_log /dev/stderr info;
pid /var/run/nginx.pid;

load_module "modules/ngx_nchan_module.so";

events {
  worker_connections 1024;
}

http {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout;

  server {
    listen 80;

    proxy_read_timeout 1d;
    nchan_websocket_ping_interval 5;

    location = /ack {
      return 204;
    }

    location = /publish {
      nchan_publisher;
      nchan_channel_id $arg_channel_id;
      nchan_message_buffer_length 50;
      nchan_message_timeout 1h;
    }

    location = / {
      nchan_pubsub;
      nchan_subscriber_channel_id $arg_channel_id;
      nchan_channel_id_split_delimiter ",";
      nchan_subscriber_last_message_id $arg_last_message_id;
      nchan_publisher_channel_id ackchannel;
      nchan_publisher_upstream_request /ack;
    }
  }
}

I'll try to reproduce it locally.

indrekj commented 6 years ago

Last week I removed nchan_publisher_upstream_request from the configuration and it seemed to get stable. Will let you know if there are any more incidents.

slact commented 6 years ago

Thanks -- I'll let you know when i find it.

slact commented 6 years ago

Can you retest this in master to see if this issue persists? I haven't addressed it directly, but I refactors a whole bunch of possibly related code, and may have incidentally fixed this issue.

slact commented 6 years ago

This error is now impossible in version 1.2.0, as the relevant code has been removed and refactored.