Closed outcassed closed 6 years ago
(looking at the code, I don't think this message is related to the cause of the crash. closing this for now)
Did this problem disappear for you? I'm seeing something similar: ... tried adding WAITING chanhead ... why?
, nothing ever gets published any more, and after a while (some minute), the publish-something request my app server sent to Nchan, times out.
This doesn't happen until the Nginx server has been up and running for fairly long (many hours? days?). Initially all is fine, and none of those tried adding ... why?
gets logged. I use long polling, Nchan 2.0, Nginx 1.15.2 and the memstore.c
(not Redis).
@kajmagnus is this only happening with Nchan 1.2.0, or have you noticed this happen before? If possible, I might want to investigate this on the server where this is happening.
@slact With 1.1.14, live updates was working, but then stopped working after a while, like now with 2.0 (well, a little bit after 2.0: Revision ebc11be979e with fix: nginx build compatibility with lua_nginx_module on Alpine linux
included).
However now with 2.0, there's the tried adding WAITING chanhead
error message too. It wasn't logged in 1.1.14. It happens shortly/immediately after the browser aborts the connection, to send a new long-polling request. (The browser calls the-request.abort()
after 30s and sends a new request, before any firewalls kills the connection? the request?.) From the Nginx error log:
web_1 | 2018/07/27 09:21:06 [info] 16#16: *28031 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-281-107?reqNr=5 HTTP/1.1", host: "e2e-test-20.localhost", referrer: "http://e2e-test-20.localhost/-1/message-title"
web_1 | 2018/07/27 09:21:06 [error] 16#16: MEMSTORE:02: tried adding WAITING chanhead 0000556B23204820 /-281-107 to chanhead_gc. why?
What does "tried adding WAITING chanhead .... to chanhead_gc" mean? The signature of the function that logs this, is: ngx_int_t chanhead_gc_add(memstore_channel_head_t *ch, const char *reason) {
— maybe it'd be good for me to know what it does? then maybe I could experiment with different settings and find a faster / more-reliable way to make the problem happen.
Maybe I can add extra DBG
messages somewhere?
About a live server: Maybe I can setup a VPS server somewhere, and keep it running until the problem starts happening, and send you SSH login? (Nginx in this case runs in a Docker container. It's possible to SSH into the Docker container by doing sth like docker-compose exec web bash
. Not sure if that's fine or if being-in-a-Docker-container makes things complicated somehow (?).)
No need to add DBG() calls just yet, i'm going to try to replicate it myself first. What does your config look like for the pub and sub locations?
@slact Publisher location: (and only Long Polling is used, currently)
nchan_max_reserved_memory 64m;
server {
# Port 81 should be blocked, should not be accessible from the publ internet.
listen 81; # [47BKFG2] in PubSub.scala
# The app server posts to this endpoint to publish WebSocket / Long Polling etc events.
location ~ /-/pubsub/publish/(.*)$ {
nchan_publisher;
nchan_channel_id "$1";
nchan_channel_id_split_delimiter ",";
}
}
Subscriber: (port 80 or 443)
# ----- WebSocket and Long Polling
#
# This is for *subscribing* to events. *Publishing* events, however, is instead
# confiured in nginx.conf, and done on port 81 (not 80 or 443).
location /-/pubsub/subscribe/ {
location ~ /-/pubsub/subscribe/(.*)$ {
nchan_subscriber;
nchan_channel_id "$1";
nchan_subscriber_first_message newest;
# Ask the appserver if this is okay, and in that way also tell it that this user
# has now connected. (Apparently nchan forwards all cookies to the appserver.)
# (Any '?' here gets url encoded and then no endpoint matches
# so the app server replies 404 Not Found. So don't use the query string here.)
# (This will match the 'location /' block below, which proxy_pass:es to app:9000.)
nchan_authorize_request /-/pubsub/authorize-subscriber/$1;
}
}
(B.t.w. to me, adding some DBG
here and there and maybe understanding Nchan a bit better, seems like mostly fun & interesting :- ))
Well, if you don't mind playing with it disruptively, I'd like a raise(SIGABRT)
right after the "tried adding WAITING chanhead" line, then examine the coredump. You'd need to rebuild with -O0 -ggdb -fvar-tracking-assignments
for best results. Otherwise, I'll try to reproduce this locally for now.
@slact If I build with those flags, and generate an Nginx core dump on my laptop in an Ubuntu 18.04 Vagrant VM, then, if I send the dump to you, is that a good idea?
I was able to generate a core dump, by adding the raise(..)
at the start of the function.
The core dump lacks some info, maybe you can see immediately what's wrong? gdb says: "cannot read pathname for load map" — although I added the debug symbol flags.
gdb
s output, when backtrace
:ing the core dump: (this is Alpine linux)
bash-4.4# gdb usr/sbin/nginx-debug /tmp/cores/core.nginx-debug.21187
GNU gdb (GDB) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from usr/sbin/nginx-debug...done. <—— it has symbols?
[New LWP 21187]
warning: Can't read pathname for load map: No error information. <—— but no "map"? what?
Core was generated by `nginx: worke'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fdacbfa78e3 in __restore_sigs () from /lib/ld-musl-x86_64.so.1
(gdb) backtrace full
#0 0x00007fdacbfa78e3 in __restore_sigs () from /lib/ld-musl-x86_64.so.1 <—— ? needs symbols ?
No symbol table info available.
#1 0x00007fdacbfa79fd in raise () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#2 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb)
Here're the flags: ( https://github.com/debiki/talkyard/blob/w-km/docker/web/Dockerfile#L168 )
# generate a debug build, will be at /usr/sbin/nginx-debug
&& ./configure $CONFIG \
--with-debug \
# for Nchan: -ggdb or -g? or -ggdb3?
--with-cc-opt='-O0 -ggdb -fvar-tracking-assignments' \
&& make -j$(getconf _NPROCESSORS_ONLN) \
&& mv objs/nginx objs/nginx-debug \
...
&& install -m755 objs/nginx-debug /usr/sbin/nginx-debug \ <— doesn't strip symbols? (only if incl -s?)
...
# What? Strips debug symbols from nginx-debug too, right?:
#&& strip /usr/sbin/nginx* \
# Instead:
&& strip /usr/sbin/nginx \
...
Maybe I need symbols for ld-musl-x86_64.so.1
too, somehow? Otherwise gdb
is unable to backtrace into the nginx-debug
binary?
Ok now I found a way to get a proper backtrace, namely kill(getpid(), SIGSEGV)
:
ngx_int_t chanhead_gc_add(memstore_channel_head_t *ch, const char *reason) {
ngx_int_t slot = memstore_slot();
DBG("Chanhead gc add %p %V: %s", ch, &ch->id, reason);
kill(getpid(), SIGSEGV);
instead of: raise(SIGABRT)
— not sure why that won't work result in a backtrace:able dump?
Example backtrace (NOT a "real" one — I just generated a SIGSEGV at the very start of the function, always)
(gdb) backtrace
#0 0x00007fb5aee5690d in kill () from /lib/ld-musl-x86_64.so.1
#1 0x00007fff9295a250 in ?? ()
#2 0x000055916bfa6026 in chanhead_gc_add (ch=0x55916cbabf00, reason=0x55916bfff910 "add owner chanhead after publish") at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:1275
#3 0x000055916bfa6610 in nchan_memstore_publish_generic (head=0x55916cbabf00, msg=0x7fb5a1ef3000, status_code=0, status_line=0x0) at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:1431
#4 0x000055916bfac1ab in nchan_store_chanhead_publish_message_generic (chead=0x55916cbabf00, msg=0x7fb5a1ef3000, msg_in_shm=1, cf=0x55916cbcdf98, callback=0x55916bf9ce74 <publish_message_generic_callback>, privdata=0x7fff9295b470)
at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:3338
#5 0x000055916bfab7a2 in nchan_store_publish_message_to_single_channel_id (channel_id=0x7fb5a1ef2040, msg=0x7fb5a1ef3000, msg_in_shm=1, cf=0x55916cbcdf98, callback=0x55916bf9ce74 <publish_message_generic_callback>,
privdata=0x7fff9295b470) at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:3196
#6 0x000055916bfab9df in nchan_store_publish_message_generic (channel_id=0x7fb5a1ef2040, msg=0x7fb5a1ef3000, msg_in_shm=1, cf=0x55916cbcdf98, callback=0x55916bf9ce74 <publish_message_generic_callback>, privdata=0x7fff9295b470)
at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:3228
#7 0x000055916bf9cd93 in receive_publish_message (sender=2, d=0x7fff9295b500) at /tmp/nginx-modules/nchan/src/store/memory/ipc-handlers.c:431
#8 0x000055916bf9f409 in memstore_ipc_alert_handler (sender=2, code=5, data=0x7fff9295b500) at /tmp/nginx-modules/nchan/src/store/memory/ipc-handlers.c:1062
#9 0x000055916bf9b0af in ipc_read_handler (ev=0x7fb5a14b42c0) at /tmp/nginx-modules/nchan/src/store/memory/ipc.c:457
#10 0x000055916be28d58 in ngx_epoll_process_events (cycle=0x55916cbac9b0, timer=18446744073709551615, flags=1) at src/event/modules/ngx_epoll_module.c:902
#11 0x000055916be148f2 in ngx_process_events_and_timers (cycle=0x55916cbac9b0) at src/event/ngx_event.c:242
---Type <return> to continue, or q <return> to quit---
#12 0x000055916be25c37 in ngx_worker_process_cycle (cycle=0x55916cbac9b0, data=0x5) at src/os/unix/ngx_process_cycle.c:750
#13 0x000055916be21fa7 in ngx_spawn_process (cycle=0x55916cbac9b0, proc=0x55916be25b4a <ngx_worker_process_cycle>, data=0x5, name=0x55916bfd391b "worker process", respawn=5) at src/os/unix/ngx_process.c:199
#14 0x000055916be25699 in ngx_reap_children (cycle=0x55916cbac9b0) at src/os/unix/ngx_process_cycle.c:622
#15 0x000055916be24095 in ngx_master_process_cycle (cycle=0x55916cbac9b0) at src/os/unix/ngx_process_cycle.c:175
#16 0x000055916bddc22e in main (argc=1, argv=0x7fff9295bb98) at src/core/nginx.c:382
Seems I got a backtrace (below). This happens lots of times, there're 40 core dumps and I randomly looked at about 15. 14 of them were like the one below: (Edit: This is maybe a little bit hard to read. There's an easier-to-read follow-up post just below, incl debug messages. /Edit)
tried adding ... why?
core dumpbash-4.4# gdb /usr/sbin/nginx-debug /tmp/cores/core.nginx-debug.38
...
(gdb) bt
#0 0x00007fcc7ffff90d in kill () from /lib/ld-musl-x86_64.so.1
#1 0x00007fffac311900 in ?? ()
#2 0x000055f3fd126169 in chanhead_gc_add (ch=0x55f3fe044700, reason=0x55f3fd17f268 "sub count == 0 after spooler dequeue") at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:1286
#3 0x000055f3fd123180 in memstore_spooler_bulk_dequeue_handler (spl=0x55f3fe044768, type=LONGPOLL, count=1, privdata=0x55f3fe044700) at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:837
#4 0x000055f3fd117d2c in spool_bubbleup_dequeue_handler (spool=0x55f3fe0447d8, sub=0x55f3fe03d840, spl=0x55f3fe044768) at /tmp/nginx-modules/nchan/src/store/spool.c:649
#5 0x000055f3fd1175d0 in spool_sub_dequeue_callback (sub=0x55f3fe03d840, data=0x55f3fe03e990) at /tmp/nginx-modules/nchan/src/store/spool.c:488
#6 0x000055f3fd10851a in longpoll_dequeue (self=0x55f3fe03d840) at /tmp/nginx-modules/nchan/src/subscribers/longpoll.c:170
#7 0x000055f3fd107c78 in sudden_abort_handler (sub=0x55f3fe03d840) at /tmp/nginx-modules/nchan/src/subscribers/longpoll.c:30
#8 0x000055f3fcfceab9 in ngx_http_terminate_request (r=0x55f3fdfdd1f0, rc=499) at src/http/ngx_http_request.c:2530
#9 0x000055f3fcfce341 in ngx_http_finalize_request (r=0x55f3fdfdd1f0, rc=499) at src/http/ngx_http_request.c:2347
#10 0x000055f3fcfcf6d2 in ngx_http_test_reading (r=0x55f3fdfdd1f0) at src/http/ngx_http_request.c:2880
#11 0x000055f3fcfcdfd7 in ngx_http_request_handler (ev=0x7fcc7265d500) at src/http/ngx_http_request.c:2242
#12 0x000055f3fcfa8d58 in ngx_epoll_process_events (cycle=0x55f3fdfd79b0, timer=1000, flags=1) at src/event/modules/ngx_epoll_module.c:902
#13 0x000055f3fcf948f2 in ngx_process_events_and_timers (cycle=0x55f3fdfd79b0) at src/event/ngx_event.c:242
#14 0x000055f3fcfa5c37 in ngx_worker_process_cycle (cycle=0x55f3fdfd79b0, data=0x1) at src/os/unix/ngx_process_cycle.c:750
#15 0x000055f3fcfa1fa7 in ngx_spawn_process (cycle=0x55f3fdfd79b0, proc=0x55f3fcfa5b4a <ngx_worker_process_cycle>, data=0x1, name=0x55f3fd15391b "worker process", respawn=1) at src/os/unix/ngx_process.c:199
#16 0x000055f3fcfa5699 in ngx_reap_children (cycle=0x55f3fdfd79b0) at src/os/unix/ngx_process_cycle.c:622
#17 0x000055f3fcfa4095 in ngx_master_process_cycle (cycle=0x55f3fdfd79b0) at src/os/unix/ngx_process_cycle.c:175
#18 0x000055f3fcf5c22e in main (argc=1, argv=0x7fffac312158) at src/core/nginx.c:382
(gdb) bt f
#0 0x00007fcc7ffff90d in kill () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#1 0x00007fffac311900 in ?? ()
No symbol table info available.
#2 0x000055f3fd126169 in chanhead_gc_add (ch=0x55f3fe044700, reason=0x55f3fd17f268 "sub count == 0 after spooler dequeue") at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:1286
slot = 1
__func__ = "chanhead_gc_add"
#3 0x000055f3fd123180 in memstore_spooler_bulk_dequeue_handler (spl=0x55f3fe044768, type=LONGPOLL, count=1, privdata=0x55f3fe044700) at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:837
head = 0x55f3fe044700
__func__ = "memstore_spooler_bulk_dequeue_handler"
#4 0x000055f3fd117d2c in spool_bubbleup_dequeue_handler (spool=0x55f3fe0447d8, sub=0x55f3fe03d840, spl=0x55f3fe044768) at /tmp/nginx-modules/nchan/src/store/spool.c:649
h = 0x55f3fd3dae60 <handlers>
#5 0x000055f3fd1175d0 in spool_sub_dequeue_callback (sub=0x55f3fe03d840, data=0x55f3fe03e990) at /tmp/nginx-modules/nchan/src/store/spool.c:488
d = 0x55f3fe03e990
spool = 0x55f3fe0447d8
__func__ = "spool_sub_dequeue_callback"
#6 0x000055f3fd10851a in longpoll_dequeue (self=0x55f3fe03d840) at /tmp/nginx-modules/nchan/src/subscribers/longpoll.c:170
fsub = 0x55f3fe03d840
r = 0x55f3fdfdd1f0
ctx = 0x55f3fdfde730
finalize_now = 1
#7 0x000055f3fd107c78 in sudden_abort_handler (sub=0x55f3fe03d840) at /tmp/nginx-modules/nchan/src/subscribers/longpoll.c:30
No locals.
#8 0x000055f3fcfceab9 in ngx_http_terminate_request (r=0x55f3fdfdd1f0, rc=499) at src/http/ngx_http_request.c:2530
cln = 0x55f3fdfde128
mr = 0x55f3fdfdd1f0
e = 0x55f3fe0063b0
#9 0x000055f3fcfce341 in ngx_http_finalize_request (r=0x55f3fdfdd1f0, rc=499) at src/http/ngx_http_request.c:2347
c = 0x7fcc7291dcb8
pr = 0x7fcc7fff2cda <getsockopt+34>
clcf = 0x7fffac312158
#10 0x000055f3fcfcf6d2 in ngx_http_test_reading (r=0x55f3fdfdd1f0) at src/http/ngx_http_request.c:2880
n = 22003
buf = ""
err = 0
rev = 0x7fcc7265d500
c = 0x7fcc7291dcb8
#11 0x000055f3fcfcdfd7 in ngx_http_request_handler (ev=0x7fcc7265d500) at src/http/ngx_http_request.c:2242
c = 0x7fcc7291dcb8
r = 0x55f3fdfdd1f0
#12 0x000055f3fcfa8d58 in ngx_epoll_process_events (cycle=0x55f3fdfd79b0, timer=1000, flags=1) at src/event/modules/ngx_epoll_module.c:902
events = 1
revents = 8197
instance = 0
i = 0
---Type <return> to continue, or q <return> to quit---
level = 140736082287520
err = 0
rev = 0x7fcc7265d500
wev = 0x3e8
queue = 0x7fcc7239d1c8
c = 0x7fcc7291dcb8
This is how the edited Nchan code looks:
ngx_int_t chanhead_gc_add(memstore_channel_head_t *ch, const char *reason) {
ngx_int_t slot = memstore_slot();
DBG("Chanhead gc add %p %V: %s", ch, &ch->id, reason);
if(!ch->shutting_down) {
assert(ch->foreign_owner_ipc_sub == NULL); //we don't accept still-subscribed chanheads
}
if(ch->slot != ch->owner && ch->shared) {
ngx_atomic_fetch_add(&ch->shared->gc.outside_refcount, -1);
ch->shared = NULL;
}
if(ch->status == WAITING && !(ch->cf && ch->cf->redis.enabled) && !(ngx_exiting || ngx_quit)) {
ERR("tried adding WAITING chanhead %p %V to chanhead_gc. why?", ch, &ch->id);
ERR("kill(getpid(), SIGSEGV) for fun and profit");
kill(getpid(), SIGSEGV); // do it yes yes <——
//don't gc it just yet.
return NGX_OK;
}
...
}
Except for one one, the very first core dump, was in nginx/src/http/modules/ngx_http_charset_filter_module.c
.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055f3fd009c41 in ngx_http_main_request_charset (r=0x55f3fe0486a0, src=0x7fffac311920) at src/http/modules/ngx_http_charset_filter_module.c:394
394 src/http/modules/ngx_http_charset_filter_module.c: No such file or directory.
(gdb) bt
#0 0x000055f3fd009c41 in ngx_http_main_request_charset (r=0x55f3fe0486a0, src=0x7fffac311920) at src/http/modules/ngx_http_charset_filter_module.c:394
#1 0x000055f3fd0096ce in ngx_http_charset_header_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_charset_filter_module.c:228
#2 0x000055f3fd00cca3 in ngx_http_userid_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_userid_filter_module.c:203
#3 0x000055f3fd00e40d in ngx_http_headers_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_headers_filter_module.c:174
#4 0x000055f3fd0ba9be in ngx_http_lua_capture_header_filter (r=0x55f3fe0486a0) at /tmp/nginx-modules/lua-nginx-module/src/ngx_http_lua_capturefilter.c:109
#5 0x000055f3fd010390 in ngx_http_not_modified_header_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_not_modified_filter_module.c:61
#6 0x000055f3fcfbf702 in ngx_http_send_header (r=0x55f3fe0486a0) at src/http/ngx_http_core_module.c:1755
#7 0x000055f3fcfea7ff in ngx_http_upstream_send_response (r=0x55f3fe0486a0, u=0x55f3fe049860) at src/http/ngx_http_upstream.c:2899
#8 0x000055f3fcfe97a2 in ngx_http_upstream_process_header (r=0x55f3fe0486a0, u=0x55f3fe049860) at src/http/ngx_http_upstream.c:2427
#9 0x000055f3fcfe7132 in ngx_http_upstream_handler (ev=0x7fcc7265d500) at src/http/ngx_http_upstream.c:1281
#10 0x000055f3fcfa8d58 in ngx_epoll_process_events (cycle=0x55f3fdfd79b0, timer=208, flags=1) at src/event/modules/ngx_epoll_module.c:902
#11 0x000055f3fcf948f2 in ngx_process_events_and_timers (cycle=0x55f3fdfd79b0) at src/event/ngx_event.c:242
#12 0x000055f3fcfa5c37 in ngx_worker_process_cycle (cycle=0x55f3fdfd79b0, data=0x1) at src/os/unix/ngx_process_cycle.c:750
#13 0x000055f3fcfa1fa7 in ngx_spawn_process (cycle=0x55f3fdfd79b0, proc=0x55f3fcfa5b4a <ngx_worker_process_cycle>, data=0x1, name=0x55f3fd15391b "worker process", respawn=-3) at src/os/unix/ngx_process.c:199
#14 0x000055f3fcfa4887 in ngx_start_worker_processes (cycle=0x55f3fdfd79b0, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:359
#15 0x000055f3fcfa3dfd in ngx_master_process_cycle (cycle=0x55f3fdfd79b0) at src/os/unix/ngx_process_cycle.c:131
#16 0x000055f3fcf5c22e in main (argc=1, argv=0x7fffac312158) at src/core/nginx.c:382
(gdb) bt f
#0 0x000055f3fd009c41 in ngx_http_main_request_charset (r=0x55f3fe0486a0, src=0x7fffac311920) at src/http/modules/ngx_http_charset_filter_module.c:394
charset = 40
main_charset = 0x1020
ctx = 0x421
#1 0x000055f3fd0096ce in ngx_http_charset_header_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_charset_filter_module.c:228
charset = 5184
source_charset = 41
dst = {len = 1056,
data = 0x7fcc7ffde209 <malloc+941> "H\215C\020\353\rD\211\346H\211\337\350\022\366\377\377\353\217H\203\304\070[]A\\A]A^A_\303SH\211\373\350(\374\377\377H\205\300t)\366@\370\001t#H\203\303\a1\322H\301\353\003H9\323t\024H\203", <incomplete sequence \320>}
src = {len = 140516260139136, data = 0x4fc108b3e0674d00 <error: Cannot access memory at address 0x4fc108b3e0674d00>}
charsets = 0x7fcc80243e00
mcf = 0xffffffff00000000
#2 0x000055f3fd00cca3 in ngx_http_userid_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_userid_filter_module.c:203
ctx = 0x7fffac312168
conf = 0x7fffac3119f0
#3 0x000055f3fd00e40d in ngx_http_headers_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_headers_filter_module.c:174
value = {len = 18446744073709551615, data = 0x55f3fe046910 ""}
i = 94506411468988
safe_status = 94506427130069
h = 0x65
conf = 0x65
#4 0x000055f3fd0ba9be in ngx_http_lua_capture_header_filter (r=0x55f3fe0486a0) at /tmp/nginx-modules/lua-nginx-module/src/ngx_http_lua_capturefilter.c:109
psr = 0x55f3fe047e48
old_ctx = 0x55f3fd1616b1
ctx = 0x0
psr_data = 0x6
#5 0x000055f3fd010390 in ngx_http_not_modified_header_filter (r=0x55f3fe0486a0) at src/http/modules/ngx_http_not_modified_filter_module.c:61
No locals.
#6 0x000055f3fcfbf702 in ngx_http_send_header (r=0x55f3fe0486a0) at src/http/ngx_http_core_module.c:1755
No locals.
#7 0x000055f3fcfea7ff in ngx_http_upstream_send_response (r=0x55f3fe0486a0, u=0x55f3fe049860) at src/http/ngx_http_upstream.c:2899
n = 94506427127904
rc = 94506427111136
p = 0x4fc108b3e0674d00
c = 0x55f3fe0486a0
clcf = 0x55f3fe049180
#8 0x000055f3fcfe97a2 in ngx_http_upstream_process_header (r=0x55f3fe0486a0, u=0x55f3fe049860) at src/http/ngx_http_upstream.c:2427
n = 94
rc = 0
c = 0x7fcc7291dcb8
#9 0x000055f3fcfe7132 in ngx_http_upstream_handler (ev=0x7fcc7265d500) at src/http/ngx_http_upstream.c:1281
c = 0x7fcc7291dac8
r = 0x55f3fe0486a0
u = 0x55f3fe049860
I'm getting the impression that 1) there's a crash in a worker, not related to Nchan. But this crash somehow messes up Nchans internal state? Maybe Nchan memstore.c
has some counters or state, that doesn't get updated properly, if the worker thread suddenly exits?
Then 2) all long polling requests that call xhrRequest.abort()
in the browser, trigger the tried adding WAITING chanhead %p %V to chanhead_gc. why?
log message (and core dump, because I added kill(...)
). Also, Nchan + long polling, stops working.
I ran all end-to-end-tests again during some hours ... and after 2 hours, this happened again: (I think this is easier to read than my previous post. Plus, includes debug log messages)
bash-4.4# gdb /usr/sbin/nginx-debug /tmp/cores/core.nginx-debug.22
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055a00edb1c41 in ngx_http_main_request_charset (r=0x55a00f6f33c0, src=0x7ffcece90ce0) at src/http/modules/ngx_http_charset_filter_module.c:394
394 src/http/modules/ngx_http_charset_filter_module.c: No such file or directory.
(gdb)
(gdb) bt
#0 0x000055a00edb1c41 in ngx_http_main_request_charset (r=0x55a00f6f33c0, src=0x7ffcece90ce0) at src/http/modules/ngx_http_charset_filter_module.c:394
#1 0x000055a00edb16ce in ngx_http_charset_header_filter (r=0x55a00f6f33c0) at src/http/modules/ngx_http_charset_filter_module.c:228
#2 0x000055a00edb4ca3 in ngx_http_userid_filter (r=0x55a00f6f33c0) at src/http/modules/ngx_http_userid_filter_module.c:203
#3 0x000055a00edb640d in ngx_http_headers_filter (r=0x55a00f6f33c0) at src/http/modules/ngx_http_headers_filter_module.c:174
#4 0x000055a00ee629be in ngx_http_lua_capture_header_filter (r=0x55a00f6f33c0) at /tmp/nginx-modules/lua-nginx-module/src/ngx_http_lua_capturefilter.c:109
#5 0x000055a00edb8390 in ngx_http_not_modified_header_filter (r=0x55a00f6f33c0) at src/http/modules/ngx_http_not_modified_filter_module.c:61
#6 0x000055a00ed67702 in ngx_http_send_header (r=0x55a00f6f33c0) at src/http/ngx_http_core_module.c:1755
#7 0x000055a00ed927ff in ngx_http_upstream_send_response (r=0x55a00f6f33c0, u=0x55a00f6f4580) at src/http/ngx_http_upstream.c:2899
#8 0x000055a00ed917a2 in ngx_http_upstream_process_header (r=0x55a00f6f33c0, u=0x55a00f6f4580) at src/http/ngx_http_upstream.c:2427
#9 0x000055a00ed8f132 in ngx_http_upstream_handler (ev=0x7f8426644560) at src/http/ngx_http_upstream.c:1281
#10 0x000055a00ed50d58 in ngx_epoll_process_events (cycle=0x55a00f66f9b0, timer=948, flags=1) at src/event/modules/ngx_epoll_module.c:902
#11 0x000055a00ed3c8f2 in ngx_process_events_and_timers (cycle=0x55a00f66f9b0) at src/event/ngx_event.c:242
#12 0x000055a00ed4dc37 in ngx_worker_process_cycle (cycle=0x55a00f66f9b0, data=0x6) at src/os/unix/ngx_process_cycle.c:750
#13 0x000055a00ed49fa7 in ngx_spawn_process (cycle=0x55a00f66f9b0, proc=0x55a00ed4db4a <ngx_worker_process_cycle>, data=0x6, name=0x55a00eefb91b "worker process", respawn=-3) at src/os/unix/ngx_process.c:199
#14 0x000055a00ed4c887 in ngx_start_worker_processes (cycle=0x55a00f66f9b0, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:359
#15 0x000055a00ed4bdfd in ngx_master_process_cycle (cycle=0x55a00f66f9b0) at src/os/unix/ngx_process_cycle.c:131
#16 0x000055a00ed0422e in main (argc=1, argv=0x7ffcece91518) at src/core/nginx.c:382
With this debug log, for Nginx worker 22 (the one with the core dump above) just before it died:
$ egrep -a '22#22' docker/data/nginx-logs/error.log | tail -n200
...
2018/08/01 03:28:19 [debug] 22#22: *46056 http process request line
2018/08/01 03:28:19 [debug] 22#22: *46056 http request line: "GET /-/pubsub/subscribe/-1930-102?reqNr=1 HTTP/1.1"
2018/08/01 03:28:19 [debug] 22#22: *46056 http uri: "/-/pubsub/subscribe/-1930-102"
2018/08/01 03:28:19 [debug] 22#22: *46056 http args: "reqNr=1"
2018/08/01 03:28:19 [debug] 22#22: *46056 http exten: ""
2018/08/01 03:28:19 [debug] 22#22: *46056 posix_memalign: 000055A00F6F19A0:4096 @16
2018/08/01 03:28:19 [debug] 22#22: *46056 http process request header line
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Host: comments-for-e2e-test-embedvo-localhost-8080.localhost"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Connection: keep-alive"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "X-Requested-With: XMLHttpRequest"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Accept: */*"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Referer: http://comments-for-e2e-test-embedvo-localhost-8080.localhost/-/embedded-editor?embeddingUrl=http://e2e-test-embedvo.localhost:8080/emb-cmts-edit-and-vote.html"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Accept-Encoding: gzip, deflate, br"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Accept-Language: en-US,en;q=0.9"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header: "Cookie: esCoE2eTestPassword=public; dwCoBrId=1533094089tlu38o9; dwCoSid=POasvOPmNK3JHj.102.1533094099077.qmlh0t6tbk; XSRF-TOKEN=1533094099077.pwjp7u1bhs"
2018/08/01 03:28:19 [debug] 22#22: *46056 http header done
2018/08/01 03:28:19 [debug] 22#22: *46056 generic phase: 0
2018/08/01 03:28:19 [debug] 22#22: *46056 rewrite phase: 1
2018/08/01 03:28:19 [debug] 22#22: *46056 test location: "/"
2018/08/01 03:28:19 [debug] 22#22: *46056 test location: ".well-known"
2018/08/01 03:28:19 [debug] 22#22: *46056 test location: "-/pubsub/subscribe/"
2018/08/01 03:28:19 [debug] 22#22: *46056 test location: ~ "/-/pubsub/subscribe/(.*)$"
2018/08/01 03:28:19 [debug] 22#22: *46056 using configuration "/-/pubsub/subscribe/(.*)$"
2018/08/01 03:28:19 [debug] 22#22: *46056 http cl:-1 max:10485760
2018/08/01 03:28:19 [debug] 22#22: *46056 rewrite phase: 3
2018/08/01 03:28:19 [debug] 22#22: *46056 post rewrite phase: 4
2018/08/01 03:28:19 [debug] 22#22: *46056 generic phase: 5
2018/08/01 03:28:19 [debug] 22#22: *46056 generic phase: 6
2018/08/01 03:28:19 [debug] 22#22: *46056 http script var: "�"
2018/08/01 03:28:19 [debug] 22#22: *46056 limit_req[0]: -2 3.960
2018/08/01 03:28:19 [debug] 22#22: *46056 http script var: "_"
2018/08/01 03:28:19 [debug] 22#22: *46056 limit_req[1]: 0 0.200
2018/08/01 03:28:19 [debug] 22#22: *46056 generic phase: 7
2018/08/01 03:28:19 [debug] 22#22: *46056 http script var: "�"
2018/08/01 03:28:19 [debug] 22#22: *46056 limit conn: 4E5C3E29 2
2018/08/01 03:28:19 [debug] 22#22: *46056 add cleanup: 000055A00F6F1808
2018/08/01 03:28:19 [debug] 22#22: *46056 http script var: "_"
2018/08/01 03:28:19 [debug] 22#22: *46056 limit conn: 29D6A3E8 2
2018/08/01 03:28:19 [debug] 22#22: *46056 add cleanup: 000055A00F6F1838
2018/08/01 03:28:19 [debug] 22#22: *46056 access phase: 8
2018/08/01 03:28:19 [debug] 22#22: *46056 access phase: 9
2018/08/01 03:28:19 [debug] 22#22: *46056 access phase: 10
2018/08/01 03:28:19 [debug] 22#22: *46056 lua access handler, uri:"/-/pubsub/subscribe/-1930-102" c:1
2018/08/01 03:28:19 [debug] 22#22: *46056 lua creating new thread
2018/08/01 03:28:19 [debug] 22#22: *46056 lua reset ctx
2018/08/01 03:28:19 [debug] 22#22: *46056 http cleanup add: 000055A00F6F1898
2018/08/01 03:28:19 [debug] 22#22: *46056 lua run thread, top:0 c:1
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Host: comments-for-e2e-test-embedvo-localhost-8080.localhost"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Connection: keep-alive"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "X-Requested-With: XMLHttpRequest"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Accept: */*"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Referer: http://comments-for-e2e-test-embedvo-localhost-8080.localhost/-/embedded-editor?embeddingUrl=http://e2e-test-embedvo.localhost:8080/emb-cmts-edit-and-vote.html"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Accept-Encoding: gzip, deflate, br"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Accept-Language: en-US,en;q=0.9"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request header: "Cookie: esCoE2eTestPassword=public; dwCoBrId=1533094089tlu38o9; dwCoSid=POasvOPmNK3JHj.102.1533094099077.qmlh0t6tbk; XSRF-TOKEN=1533094099077.pwjp7u1bhs"
2018/08/01 03:28:19 [debug] 22#22: *46056 lua resume returned 0
2018/08/01 03:28:19 [debug] 22#22: *46056 lua light thread ended normally
2018/08/01 03:28:19 [debug] 22#22: *46056 lua deleting light thread
2018/08/01 03:28:19 [debug] 22#22: *46056 post access phase: 11
2018/08/01 03:28:19 [debug] 22#22: *46056 generic phase: 12
2018/08/01 03:28:19 [debug] 22#22: *46056 generic phase: 13
2018/08/01 03:28:19 [debug] 22#22: *46056 http script capture: "-1930-102"
2018/08/01 03:28:19 [debug] 22#22: *46056 http cleanup add: 000055A00F6F1908
2018/08/01 03:28:19 [debug] 22#22: *46056 http script copy: "/-/pubsub/authorize-subscriber/"
2018/08/01 03:28:19 [debug] 22#22: *46056 http script capture: "-1930-102"
2018/08/01 03:28:19 [debug] 22#22: *46056 http finalize request: -4, "/-/pubsub/subscribe/-1930-102?reqNr=1" a:1, c:2
2018/08/01 03:28:19 [debug] 22#22: *46056 http request count:2 blk:0
2018/08/01 03:28:19 [debug] 22#22: *46056 http run request: "/-/pubsub/subscribe/-1930-102?reqNr=1"
2018/08/01 03:28:19 [debug] 22#22: *46056 http request empty handler
2018/08/01 03:28:19 [debug] 22#22: *46056 http run request: "/-/pubsub/subscribe/-1930-102?reqNr=1"
2018/08/01 03:28:19 [debug] 22#22: *46056 http test reading
2018/08/01 03:28:19 [info] 22#22: *46056 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-1930-102?reqNr=1 HTTP/1.1", host: "comments-for-e2e-test-embedvo-localhost-8080.localhost", referrer: "http://comments-for-e2e-test-embedvo-localhost-8080.localhost/-/embedded-editor?embeddingUrl=http://e2e-test-embedvo.localhost:8080/emb-cmts-edit-and-vote.html"
2018/08/01 03:28:19 [debug] 22#22: *46056 http finalize request: 499, "/-/pubsub/subscribe/-1930-102?reqNr=1" a:1, c:1
2018/08/01 03:28:19 [debug] 22#22: *46056 http terminate request count:1
2018/08/01 03:28:19 [debug] 22#22: *46056 http finalize request: 0, "/-/pubsub/subscribe/-1930-102?reqNr=1" a:1, c:1
2018/08/01 03:28:19 [debug] 22#22: *46056 http terminate request count:1
2018/08/01 03:28:19 [debug] 22#22: *46056 http terminate cleanup count:1 blk:0
2018/08/01 03:28:19 [debug] 22#22: *46056 lua request cleanup: forcible=0
2018/08/01 03:28:19 [debug] 22#22: *46056 http terminate cleanup count:1 blk:0
2018/08/01 03:28:19 [debug] 22#22: *46056 http posted request: "/-/pubsub/subscribe/-1930-102?reqNr=1"
2018/08/01 03:28:19 [debug] 22#22: *46056 http terminate handler count:1
2018/08/01 03:28:19 [debug] 22#22: *46056 http request count:1 blk:0
2018/08/01 03:28:19 [debug] 22#22: *46056 http close request
2018/08/01 03:28:19 [debug] 22#22: *46056 lua log handler, uri:"/-/pubsub/subscribe/-1930-102" c:0
2018/08/01 03:28:19 [debug] 22#22: *46056 http log handler
2018/08/01 03:28:19 [debug] 22#22: *46056 run cleanup: 000055A00F6F1838
2018/08/01 03:28:19 [debug] 22#22: *46056 run cleanup: 000055A00F6F1808
2018/08/01 03:28:19 [debug] 22#22: *46056 free: 000055A00F6F0980, unused: 4
2018/08/01 03:28:19 [debug] 22#22: *46056 free: 000055A00F6F19A0, unused: 2076
2018/08/01 03:28:19 [debug] 22#22: *46056 close http connection: 28
2018/08/01 03:28:19 [debug] 22#22: *46056 reusable connection: 0
2018/08/01 03:28:19 [debug] 22#22: *46056 free: 000055A00F6F0560
2018/08/01 03:28:19 [debug] 22#22: *46056 free: 000055A00F6DE400, unused: 136
(and then apparently it died)
theXhrRequest.abort()
browser side) now triggers the tried adding ... why?
"bad" code, and Nchan-long-polling stops working.Core dump for worker 17:
bash-4.4# gdb /usr/sbin/nginx-debug /tmp/cores/core.nginx-debug.17
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f8433fe690d in kill () from /lib/ld-musl-x86_64.so.1
(gdb) bt
#0 0x00007f8433fe690d in kill () from /lib/ld-musl-x86_64.so.1 <—— that's kill(getpid(), SIGSEGV)
#1 0x00007ffcece90ce0 in ?? ()
#2 0x000055a00eece169 in chanhead_gc_add (ch=0x55a00f6e0c00, reason=0x55a00ef27268 "sub count == 0 after spooler dequeue") at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:1286
#3 0x000055a00eecb180 in memstore_spooler_bulk_dequeue_handler (spl=0x55a00f6e0c68, type=LONGPOLL, count=1, privdata=0x55a00f6e0c00) at /tmp/nginx-modules/nchan/src/store/memory/memstore.c:837
#4 0x000055a00eebfd2c in spool_bubbleup_dequeue_handler (spool=0x55a00f6e0cd8, sub=0x55a00f677560, spl=0x55a00f6e0c68) at /tmp/nginx-modules/nchan/src/store/spool.c:649
#5 0x000055a00eebf5d0 in spool_sub_dequeue_callback (sub=0x55a00f677560, data=0x55a00f6d65b0) at /tmp/nginx-modules/nchan/src/store/spool.c:488
#6 0x000055a00eeb051a in longpoll_dequeue (self=0x55a00f677560) at /tmp/nginx-modules/nchan/src/subscribers/longpoll.c:170
#7 0x000055a00eeafc78 in sudden_abort_handler (sub=0x55a00f677560) at /tmp/nginx-modules/nchan/src/subscribers/longpoll.c:30
#8 0x000055a00ed76ab9 in ngx_http_terminate_request (r=0x55a00f6daa70, rc=499) at src/http/ngx_http_request.c:2530
#9 0x000055a00ed76341 in ngx_http_finalize_request (r=0x55a00f6daa70, rc=499) at src/http/ngx_http_request.c:2347
#10 0x000055a00ed776d2 in ngx_http_test_reading (r=0x55a00f6daa70) at src/http/ngx_http_request.c:2880
#11 0x000055a00ed75fd7 in ngx_http_request_handler (ev=0x7f8426644500) at src/http/ngx_http_request.c:2242
#12 0x000055a00ed50d58 in ngx_epoll_process_events (cycle=0x55a00f66f9b0, timer=996, flags=1) at src/event/modules/ngx_epoll_module.c:902
#13 0x000055a00ed3c8f2 in ngx_process_events_and_timers (cycle=0x55a00f66f9b0) at src/event/ngx_event.c:242
#14 0x000055a00ed4dc37 in ngx_worker_process_cycle (cycle=0x55a00f66f9b0, data=0x1) at src/os/unix/ngx_process_cycle.c:750
#15 0x000055a00ed49fa7 in ngx_spawn_process (cycle=0x55a00f66f9b0, proc=0x55a00ed4db4a <ngx_worker_process_cycle>, data=0x1, name=0x55a00eefb91b "worker process", respawn=-3) at src/os/unix/ngx_process.c:199
#16 0x000055a00ed4c887 in ngx_start_worker_processes (cycle=0x55a00f66f9b0, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:359
#17 0x000055a00ed4bdfd in ngx_master_process_cycle (cycle=0x55a00f66f9b0) at src/os/unix/ngx_process_cycle.c:131
#18 0x000055a00ed0422e in main (argc=1, argv=0x7ffcece91518) at src/core/nginx.c:382
The last debug log messages for worker 17:
$ egrep -a '17#17' docker/data/nginx-logs/error.log | tail -n200
...
2018/08/01 03:28:25 [debug] 17#17: *46014 http process request line
2018/08/01 03:28:25 [debug] 17#17: *46014 http request line: "GET /-/pubsub/subscribe/-1930-102?reqNr=1 HTTP/1.1"
2018/08/01 03:28:25 [debug] 17#17: *46014 http uri: "/-/pubsub/subscribe/-1930-102"
2018/08/01 03:28:25 [debug] 17#17: *46014 http args: "reqNr=1"
2018/08/01 03:28:25 [debug] 17#17: *46014 http exten: ""
2018/08/01 03:28:25 [debug] 17#17: *46014 posix_memalign: 000055A00F675960:4096 @16
2018/08/01 03:28:25 [debug] 17#17: *46014 http process request header line
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Host: comments-for-e2e-test-embedvo-localhost-8080.localhost"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Connection: keep-alive"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "X-Requested-With: XMLHttpRequest"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Accept: */*"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Referer: http://comments-for-e2e-test-embedvo-localhost-8080.localhost/-/embedded-comments?embeddingUrl=http://e2e-test-embedvo.localhost:8080/emb-cmts-edit-and-vote.html"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Accept-Encoding: gzip, deflate, br"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Accept-Language: en-US,en;q=0.9"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header: "Cookie: esCoE2eTestPassword=public; dwCoBrId=1533094089tlu38o9; dwCoSid=POasvOPmNK3JHj.102.1533094099077.qmlh0t6tbk; XSRF-TOKEN=1533094099077.pwjp7u1bhs"
2018/08/01 03:28:25 [debug] 17#17: *46014 http header done
2018/08/01 03:28:25 [debug] 17#17: *46014 generic phase: 0
2018/08/01 03:28:25 [debug] 17#17: *46014 rewrite phase: 1
2018/08/01 03:28:25 [debug] 17#17: *46014 test location: "/"
2018/08/01 03:28:25 [debug] 17#17: *46014 test location: ".well-known"
2018/08/01 03:28:25 [debug] 17#17: *46014 test location: "-/pubsub/subscribe/"
2018/08/01 03:28:25 [debug] 17#17: *46014 test location: ~ "/-/pubsub/subscribe/(.*)$"
2018/08/01 03:28:25 [debug] 17#17: *46014 using configuration "/-/pubsub/subscribe/(.*)$"
2018/08/01 03:28:25 [debug] 17#17: *46014 http cl:-1 max:10485760
2018/08/01 03:28:25 [debug] 17#17: *46014 rewrite phase: 3
2018/08/01 03:28:25 [debug] 17#17: *46014 post rewrite phase: 4
2018/08/01 03:28:25 [debug] 17#17: *46014 generic phase: 5
2018/08/01 03:28:25 [debug] 17#17: *46014 generic phase: 6
2018/08/01 03:28:25 [debug] 17#17: *46014 http script var: "�"
2018/08/01 03:28:25 [debug] 17#17: *46014 limit_req[0]: -2 8.040
2018/08/01 03:28:25 [debug] 17#17: *46014 http script var: "_"
2018/08/01 03:28:25 [debug] 17#17: *46014 limit_req[1]: 0 0.000
2018/08/01 03:28:25 [debug] 17#17: *46014 generic phase: 7
2018/08/01 03:28:25 [debug] 17#17: *46014 http script var: "�"
2018/08/01 03:28:25 [debug] 17#17: *46014 limit conn: 4E5C3E29 5
2018/08/01 03:28:25 [debug] 17#17: *46014 add cleanup: 000055A00F6DB8A8
2018/08/01 03:28:25 [debug] 17#17: *46014 http script var: "_"
2018/08/01 03:28:25 [debug] 17#17: *46014 limit conn: 29D6A3E8 5
2018/08/01 03:28:25 [debug] 17#17: *46014 add cleanup: 000055A00F6DB8D8
2018/08/01 03:28:25 [debug] 17#17: *46014 access phase: 8
2018/08/01 03:28:25 [debug] 17#17: *46014 access phase: 9
2018/08/01 03:28:25 [debug] 17#17: *46014 access phase: 10
2018/08/01 03:28:25 [debug] 17#17: *46014 lua access handler, uri:"/-/pubsub/subscribe/-1930-102" c:1
2018/08/01 03:28:25 [debug] 17#17: *46014 lua creating new thread
2018/08/01 03:28:25 [debug] 17#17: *46014 lua reset ctx
2018/08/01 03:28:25 [debug] 17#17: *46014 http cleanup add: 000055A00F6DB938
2018/08/01 03:28:25 [debug] 17#17: *46014 lua run thread, top:0 c:1
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Host: comments-for-e2e-test-embedvo-localhost-8080.localhost"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Connection: keep-alive"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "X-Requested-With: XMLHttpRequest"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Accept: */*"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Referer: http://comments-for-e2e-test-embedvo-localhost-8080.localhost/-/embedded-comments?embeddingUrl=http://e2e-test-embedvo.localhost:8080/emb-cmts-edit-and-vote.html"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Accept-Encoding: gzip, deflate, br"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Accept-Language: en-US,en;q=0.9"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua request header: "Cookie: esCoE2eTestPassword=public; dwCoBrId=1533094089tlu38o9; dwCoSid=POasvOPmNK3JHj.102.1533094099077.qmlh0t6tbk; XSRF-TOKEN=1533094099077.pwjp7u1bhs"
2018/08/01 03:28:25 [debug] 17#17: *46014 lua resume returned 0
2018/08/01 03:28:25 [debug] 17#17: *46014 lua light thread ended normally
2018/08/01 03:28:25 [debug] 17#17: *46014 lua deleting light thread
2018/08/01 03:28:25 [debug] 17#17: *46014 post access phase: 11
2018/08/01 03:28:25 [debug] 17#17: *46014 generic phase: 12
2018/08/01 03:28:25 [debug] 17#17: *46014 generic phase: 13
2018/08/01 03:28:25 [debug] 17#17: *46014 http script capture: "-1930-102"
2018/08/01 03:28:25 [debug] 17#17: *46014 http cleanup add: 000055A00F6DB9A8
2018/08/01 03:28:25 [debug] 17#17: *46014 http script copy: "/-/pubsub/authorize-subscriber/"
2018/08/01 03:28:25 [debug] 17#17: *46014 http script capture: "-1930-102"
2018/08/01 03:28:25 [debug] 17#17: *46014 http finalize request: -4, "/-/pubsub/subscribe/-1930-102?reqNr=1" a:1, c:2
2018/08/01 03:28:25 [debug] 17#17: *46014 http request count:2 blk:0
2018/08/01 03:28:25 [debug] 17#17: *46014 http run request: "/-/pubsub/subscribe/-1930-102?reqNr=1"
2018/08/01 03:28:25 [debug] 17#17: *46014 http request empty handler
2018/08/01 03:28:26 [debug] 17#17: *46014 http run request: "/-/pubsub/subscribe/-1930-102?reqNr=1"
2018/08/01 03:28:26 [debug] 17#17: *46014 http test reading
2018/08/01 03:28:26 [info] 17#17: *46014 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-1930-102?reqNr=1 HTTP/1.1", host: "comments-for-e2e-test-embedvo-localhost-8080.localhost", referrer: "http://comments-for-e2e-test-embedvo-localhost-8080.localhost/-/embedded-comments?embeddingUrl=http://e2e-test-embedvo.localhost:8080/emb-cmts-edit-and-vote.html"
2018/08/01 03:28:26 [debug] 17#17: *46014 http finalize request: 499, "/-/pubsub/subscribe/-1930-102?reqNr=1" a:1, c:1
2018/08/01 03:28:26 [debug] 17#17: *46014 http terminate request count:1
2018/08/01 03:28:26 [error] 17#17: MEMSTORE:01: tried adding WAITING chanhead 000055A00F6E0C00 /-1930-102 to chanhead_gc. why?
2018/08/01 03:28:26 [error] 17#17: MEMSTORE:01: kill(getpid(), SIGSEGV) for fun and profit
***
I'm thinking if I want to work-around this quickly for now, I can switch to Redis instead of using memstore.c
? Maybe then Nchan's internal state will still be ok, also after a worker crash? Hmm.
And maybe there's some problem with how I use Lua, that triggers the first worker crash. Not sure if that's my "fault" or if there's a bug somewhere in nginx-lua-module, or if I'm using incompatible verisons somehow :- /
What do you think :- )
And b.t.w. @caseyf also in my case, Nchan apparently wasn't the cause of the worker crash. Instead, the worker crash "messed up" Nchan, so that those error messages you posted in the Orig Post, started happening. (?)
@kajmagnus that's some really nice work tracking this down. Yes, Nchan does not handle worker crashes well, and because the channel keyspace is sharded between workers, should one die it loses state associated with channels in that worker's shard. Subscribers for the affected channels on other workers will get kicked off after about 5 seconds. Until then, the affected channels will fail to forward messages to all Nginx workers and, consequently, to all subscribers.
The quickest workaround is to run Nginx with 1 worker, but this has its obvious limitations. Using Redis won't help, since internally it works on top of the memstore. With just 1 worker crashing, you'll be getting gradual shared memory leakage, but all subscribers will be disconnected after the crash, leaving none unable to receive messages.
The proper way, of course, is to track down the error in the Lua module or Openresty that's causing this. I've had a lot of experience digging through the lua module's internals, so if you're interested shoot me an email to support@nchan.io , and I can help you with that.
@slact Thanks for the info :- ) I'll first post the backtrace in an issue at lua-nginx-module
, LuaJIT and Nginx etc versions, and see what the people say, over there. Maybe they'll just reply "incompatible versions". https://github.com/openresty/lua-nginx-module/issues/1361
@slact:
Subscribers for the affected channels on other workers will get kicked off after about 5 seconds. Until then, the affected channels will fail to forward messages to all Nginx workers and, consequently, to all subscribers
Is the intention that all should be fine again, after a short while? ("after about 5 seconds")
What happens is, in at least my case, that Nchan long polling stays permanently broken, until after an Nginx restart. This means that a worker crash, breaks the whole Nginx server (from a pubsub point of view) ... I'm thinking maybe you didn't know about that?
Here's yesterday's end-to-end test ... and the problem persists today: (the same Nginx server instance is still up and running)
yesterday:
2018/08/02 09:00:48 [info] 18#18: *25184 client 172.18.0.1 closed keepalive connection
2018/08/02 09:00:48 [info] 23#23: *25179 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-2065-104?reqNr=7 HTTP/1.1", host: "e2e-test-e1-o0-retry.localhost", referrer: "http://e2e-test-e1-o0-retry.localhost/-2/maria"
2018/08/02 09:00:48 [error] 23#23: MEMSTORE:02: tried adding WAITING chanhead 00005576B72D1DC0 /-2065-104 to chanhead_gc. why?
2018/08/02 09:00:48 [info] 14#14: *25187 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-2065-107?reqNr=8 HTTP/1.1", host: "e2e-test-e1-o0-retry.localhost", referrer: "http://e2e-test-e1-o0-retry.localhost/-1/message-title"
2018/08/02 09:00:48 [info] 16#16: *25186 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-2065-107?reqNr=9 HTTP/1.1", host: "e2e-test-e1-o0-retry.localhost", referrer: "http://e2e-test-e1-o0-retry.localhost/-1/message-title"
2018/08/02 09:00:48 [error] 14#14: MEMSTORE:01: tried adding WAITING chanhead 00005576B723B020 /-2065-107 to chanhead_gc. why?
2018/08/02 09:00:48 [info] 18#18: *25178 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/-2065-104?reqNr=8 HTTP/1.1", host: "e2e-test-e1-o0-retry.localhost", referrer: "http://e2e-test-e1-o0-retry.localhost/-1/message-title"
2018/08/02 09:00:48 [error] 18#18: MEMSTORE:05: tried adding WAITING chanhead 00005576B72BA780 /-2065-104 to chanhead_gc. why?
:
:
:
today:
:
2018/08/03 03:14:23 [info] 18#18: *25477 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/1-103?reqNr=1 HTTP/1.1", host: "localhost", referrer: "http://localhost/-5/chat-channel"
2018/08/03 03:14:23 [error] 18#18: MEMSTORE:05: tried adding WAITING chanhead 00005576B729DB00 /1-103 to chanhead_gc. why?
2018/08/03 03:14:23 [info] 23#23: *25486 client 172.18.0.1 closed keepalive connection
2018/08/03 03:14:24 [info] 14#14: *25389 client prematurely closed connection, client: 172.18.0.1, server: _, request: "GET /-/pubsub/subscribe/1-100?reqNr=4 HTTP/1.1", host: "localhost", referrer: "http://localhost/-5/chat-channel"
2018/08/03 03:14:24 [error] 14#14: MEMSTORE:01: tried adding WAITING chanhead 00005576B729F0A0 /1-100 to chanhead_gc. why?
Anyway, so the problem is (I think) that Nginx can segfault if the browser calls theXhrRequest.abort()
and Lua later runs ngx_http_lua_capture_header_filter(ngx_http_request_t *r)
and calls return ngx_http_lua_next_header_filter(r)
= the same as ngx_http_top_header_filter(r)
. Maybe I can just avoid that code from running at all, by disabling the Lua log phase, like so: log_by_lua_block { return }
, in the pub-sub location only. I'll give that a try :- )
... Hmm actually isn't it weird that Nginx apparently tries to reply, to an aborted request. The backtrace:
ngx_http_main_request_charset
ngx_http_charset_header_filter
ngx_http_userid_filter
ngx_http_headers_filter
ngx_http_lua_capture_header_filter <— maybe isn't Luas fault? ... Instead ...
ngx_http_not_modified_header_filter <— set by Nchan I would think
ngx_http_send_header
ngx_http_upstream_send_response <— ... why does Nginx do this, although req aborted :-/
ngx_http_upstream_process_header
ngx_http_upstream_handler
@kajmagnus did you ever figure out a solution to this?
I've been having trouble with nginx becoming unresponsive. It seems that a worker or workers die and "interprocess alerts in transit" grows forever. I have to run
kill -9
in a loop until nginx eventually stops.This has been happening for a long time but I recently added some new traffic and the problem is worse.
I'm running the current nchan and nginx 1.13.12
The following message appears in the error log (the long notifications.... strings are channel IDs)