slact / nchan

Fast, horizontally scalable, multiprocess pub/sub queuing server and proxy for HTTP, long-polling, Websockets and EventSource (SSE), powered by Nginx.
https://nchan.io/
Other
3k stars 293 forks source link

Assertion failed: data.shm_chid->len >= 1 #452

Closed indrekj closed 4 years ago

indrekj commented 6 years ago

Seeing:

Assertion failed: data.shm_chid->len >= 1 (/usr/src/nchan-1.1.14/src/store/memory/ipc-handlers.c: memstore_ipc_send_get_message: 528)

There doesn't seem to be anything relevant before that.

Also sometimes I've seen:

Assertion failed: 0 (/usr/src/nchan-1.1.14/src/store/memory/ipc.c: ipc_write_alert_fd: 197)
Assertion failed: 0 (/usr/src/nchan-1.1.14/src/store/memory/ipc.c: ipc_write_alert_fd: 197)
Assertion failed: spool->msg_status == MSG_CHANNEL_NOTREADY || spool->msg_status == MSG_INVALID (/usr/src/nchan-1.1.14/src/store/spool.c: its_time_for_a_spooling: 1087)

nchan version: 1.1.14 with redis store.

We have nchan deployed to multiple environments and it seems to happen in each environment. Usually every few days or so, sometimes twice a day.

nginx info:

/ # nginx -V
nginx version: nginx/1.13.8
built by gcc 6.4.0 (Alpine 6.4.0)
built with OpenSSL 1.0.2n  7 Dec 2017
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-http_ssl_module --with-http_realip_module --with-http_addition_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_random_index_module --with-http_secure_link_module --with-http_stub_status_module --with-http_auth_request_module --with-http_xslt_module=dynamic --with-http_image_filter_module=dynamic --with-http_geoip_module=dynamic --with-http_perl_module=dynamic --with-threads --with-stream --with-stream_ssl_module --with-http_slice_module --with-mail --with-mail_ssl_module --with-file-aio --with-http_v2_module --with-ipv6 --add-dynamic-module=/usr/src/nchan-1.1.14

nchan configuration:

    daemon off;

    user nginx;
    worker_processes 3;

    error_log /dev/stderr info;
    pid /var/run/nginx.pid;

    load_module "modules/ngx_nchan_module.so";

    events {
      worker_connections 1024;
    }

    http {
      include /etc/nginx/mime.types;
      default_type application/octet-stream;
      access_log /dev/stdout;

      upstream redis_cluster {
        nchan_redis_server redis://redis-master-svc:6379;
      }

      server {
        listen 80;

        proxy_read_timeout 1d;
        nchan_websocket_ping_interval 10s;

        location /nchan-status {
          nchan_stub_status;
        }

        location /nginx-status {
          stub_status;
        }

        location = /publish {
          nchan_publisher;
          nchan_channel_id $arg_channel_id;
          nchan_message_buffer_length 50;
          nchan_message_timeout 1h;
          nchan_redis_pass redis_cluster;
        }

        location = / {
          nchan_pubsub;
          nchan_subscriber_channel_id $arg_channel_id;
          nchan_channel_id_split_delimiter ",";
          nchan_subscriber_last_message_id $arg_last_message_id;
          nchan_publisher_channel_id ackchannel;
          nchan_redis_pass redis_cluster;
          nchan_max_channel_id_length 9434;
        }
      }
    }

So far our only workaround is a e2e check (liveness probe) that restarts the service when publish/subscribe starts to fail.

slact commented 6 years ago

Please see if you can reproduce this with a build from master. If so, I'd like to see a coredump backtrace. Let me know if this bug persists in master and we'll figure out the rest from there.