Closed abregar closed 7 years ago
@abregar Can you move to the master version ? (a new version will be released soon) The shmem is not unlocked after worker crash. Even it was, it can be in an inconsistent state. Is better to restart the nginx server.
I have been just trying to verify myself that mem is not unlocked after raising SIGSEGV at that particular spot. Is there any particular improvement regarding this particular issue in the master (eq. checking if message is null or smth)? I can give it a try if there is.
There were some changes regarding memory allocation on message creation process. Specially on the last commit, that can use an uncleaned memory area.
ok, tnx for the hint.
But according to core dump of mine, msg allocation happens on a different point for this case.
It is channel_deleted msg on shmem cleanup, looks like allocated at
ngx_http_push_stream_convert_char_to_msg_on_shared (l.262 utils)
Should this one also be calloc?
Other question - how can one monitor shared memory usage size?
Is it possible some edge case where creating channel and subscriber success, but with no space left in shmem for chanel_delete_msg, resulting in null msg allocation. Such channel will segfault at cleanup. Afaics ngnix should return crit logline 'ngx_slab_alloc() failed: no memory' but greeping the logs I cant see any.
@abregar the code yoou are using have changed and will be difficult to match the core dump with the current code, this is one of the reasons I asked to use the master version. The other is that it has important fixes.
A problem on allocating memory shouldn't cause a problem on cleanup, if this is happening is a bug. (one of this situations was fixed on the commit I have mentioned)
About monitor shared memory usage as far as I know there is no way to do that. For instance, when we ask to store 2 bytes on shm may be that nginx fill up a 8 bytes bucket. So to have a proper value of memory usage nginx core functions should provide it, but they don't.
Please, move on to master code, and if you still have core dumps I will be happy to debug and provide a fix if the problem is on the module.
I am experiencing the same issue on one of my servers. I am running nginx 1.9.11, using the push-stream-module on commit 6b52c9360ceea4d34af88117d488894423fc5ab8, which is currently 3 commits behind master. Here is the full backtrace:
Reading symbols from ../sbin/nginx...done.
[New LWP 11566]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `nginx: worker process '.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000048a293 in ngx_http_push_stream_get_formatted_message (message=0x0, channel=0x0, r=0x26cb050) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1359
1359 return message->formatted_messages + pslcf->message_template_index - 1;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 nss-pam-ldapd-0.8.13-8.el7.x86_64 nss-softokn-freebl-3.16.2.3-13.el7_1.x86_64 openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) backtrace full
#0 0x000000000048a293 in ngx_http_push_stream_get_formatted_message (message=0x0, channel=0x0, r=0x26cb050) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1359
pslcf = 0x26ae898
#1 ngx_http_push_stream_send_response_message (r=0x26cb050, channel=channel@entry=0x7f6d6a8ed000, msg=0x0, send_callback=send_callback@entry=1, send_separator=send_separator@entry=0)
at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:579
pslcf = <optimized out>
ctx = 0x26cbe30
use_jsonp = 0
rc = 0
#2 0x000000000048dbb1 in ngx_http_push_stream_delete_channels_data (data=data@entry=0x7f6d6a8dd000) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:126
subscription = <optimized out>
subscriber = 0x26cbf18
mcf = 0x266aeb0
channel = 0x7f6d6a8ed000
worker = 0x7f6d6a8ec0c0
cur_worker = 0x7f6d6a8ec0c0
cur = <optimized out>
q = 0x7f6d6a8ed028
#3 0x000000000048e6fd in ngx_http_push_stream_memory_cleanup () at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1104
data = 0x7f6d6a8dd000
global_data = 0x7f6d6f9a8000
q = 0x7f6d6a8eb1e8
#4 ngx_http_push_stream_memory_cleanup_timer_wake_handler (ev=<optimized out>) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1312
No locals.
#5 0x0000000000429bb5 in ngx_event_expire_timers () at src/event/ngx_event_timer.c:94
ev = 0x6dd620 <ngx_http_push_stream_memory_cleanup_event>
root = 0x26696c0
sentinel = 0x6cea80 <ngx_event_timer_sentinel>
#6 0x0000000000429828 in ngx_process_events_and_timers (cycle=cycle@entry=0x265e950) at src/event/ngx_event.c:256
flags = 1
timer = <optimized out>
delta = 381
#7 0x000000000042f429 in ngx_worker_process_cycle (cycle=0x265e950, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
worker = <optimized out>
#8 0x000000000042dc76 in ngx_spawn_process (cycle=cycle@entry=0x265e950, proc=proc@entry=0x42f3a8 <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x49550d "worker process", respawn=respawn@entry=-3)
at src/os/unix/ngx_process.c:198
on = 1
pid = 0
s = 0
#9 0x000000000042e7db in ngx_start_worker_processes (cycle=cycle@entry=0x265e950, n=16, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:358
i = 0
ch = {command = 1, pid = 0, slot = 0, fd = 0}
#10 0x000000000042fb25 in ngx_master_process_cycle (cycle=cycle@entry=0x265e950) at src/os/unix/ngx_process_cycle.c:130
title = 0x26e2534 "master process ../sbin/nginx"
p = <optimized out>
size = <optimized out>
i = <optimized out>
n = <optimized out>
sigio = <optimized out>
set = {__val = {0 <repeats 16 times>}}
itv = {it_interval = {tv_sec = 40771296, tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}
live = <optimized out>
delay = <optimized out>
ls = <optimized out>
ccf = 0x2660478
#11 0x000000000041168d in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:367
---Type <return> to continue, or q <return> to quit---
b = <optimized out>
log = 0x6cc400 <ngx_log>
i = <optimized out>
cycle = 0x265e950
init_cycle = {conf_ctx = 0x0, pool = 0x265e450, log = 0x6cc400 <ngx_log>, new_log = {log_level = 0, file = 0x0, connection = 0, disk_full_time = 0, handler = 0x0, data = 0x0, writer = 0x0, wdata = 0x0, action = 0x0,
next = 0x0}, log_use_stderr = 0, files = 0x0, free_connections = 0x0, free_connection_n = 0, modules = 0x0, modules_n = 0, modules_used = 0, reusable_connections_queue = {prev = 0x0, next = 0x0}, listening = {elts = 0x0,
nelts = 0, size = 0, nalloc = 0, pool = 0x0}, paths = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, config_dump = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, open_files = {last = 0x0, part = {
elts = 0x0, nelts = 0, next = 0x0}, size = 0, nalloc = 0, pool = 0x0}, shared_memory = {last = 0x0, part = {elts = 0x0, nelts = 0, next = 0x0}, size = 0, nalloc = 0, pool = 0x0}, connection_n = 0, files_n = 0,
connections = 0x0, read_events = 0x0, write_events = 0x0, old_cycle = 0x0, conf_file = {len = 29, data = 0x265e4a0 "../conf/nginx.conf"}, conf_param = {len = 0, data = 0x0}, conf_prefix = {len = 19,
data = 0x265e4a0 "../conf/nginx.conf"}, prefix = {len = 14, data = 0x491402 "../"}, lock_file = {len = 0, data = 0x0}, hostname = {len = 0, data = 0x0}}
cd = <optimized out>
ccf = <optimized out>
(gdb) q
@kjad please, update your server at least to 4aa2abdc420abf14a6c02e630763442bb1ea4e56 this should solve the segmentation fault that you had
@wandenberg Thank you for the follow up. Per your recommendation, I re-compiled using the latest commit b2056f63c27117952afc82d7a295acf0054edc28, but I am still seeing the same error. Still running nginx 1.9.11. It seems to happen approximately once a week on low traffic days. I can provide another backtrace if you request, but it looks the same as the previous post.
@kjad please create a new coredump and make it available for me with your binary together.
I was on some other stuff and unable to try latest master build, but I see @kjad followed up with most recent build and still same issue occurs. ( @kjad were you able to produce coredump?)
I would like just to consider chance that 'Out of socket memory' is the possible issue here due I see that in dmesg when crash happens. Reading kernel sources on tcp, this message may occur only at single point and that is at 'tcp_out_of_resources' method. And there are two cases for that method to return true:
@wandenberg I see you have added some logging into the br shm_alloc_error - is this the br which actually tries to solve also this issue?
@wandenberg I cannot provide the coredump file and binary, but I can run any gdb/debugging commands you might need. Give me a list of commands, and I will provide the output.
@abregar and @kjad can you try to use the shm_alloc_error and provide backtrace and, if possible, coredump when the error happens. This branch have some log messages that will help to track the problem and also a slightly change on some allocate function call that can fix the issue.
@wandenberg I will switch our servers to use the shm_alloc_error
branch next week and provide the logs when the error happens. Thanks again.
Was not able to mess with prod, but finally got focus into this a bit deeper. Was able to model my use case in lab and reliably reproduced the issue. It is like this:
` push stream module: unable to allocate root structure message in shared memory
*7625 push stream module: unable to allocate memory to channel deleted message, `
@abregar can you send me the script you used to reproduce the problem? Will make the debug and fix easier.
Sure. Run sub_stress.sh and pub_stress.sh until nginx starts complain on shm allocation problem. Then run delete.sh in piexoto.sh.zip
Is there any particular reason that channel->channel_deleted_message is tried to be allocated on shm only at the delete occurrence (which will fail if there is no space left). Wouldn't it benefit from single message static instance?
Hi @abregar and @kjad can you test the code on this branch https://github.com/wandenberg/nginx-push-stream-module/tree/issue228 ?
@abregar the channel_deleted_message
is allocate on shm only at delete occurrence for two reasons: 1) is not a feature widely used and this way we save memory (and time) from applications that does not use this feature. 2) you can specify a custom delete channel text to be sent on the message.
(Sorry for the long delay, I'm returning the development/test of this module after a long period of inactivity)
This happens on one of my sites (nginx 1.2.8, luajit-2.0.4,lua-nginx-module-0.9.16,nginx-push-stream-module-0.5.1)
`#0 0x000000000050b54e in ngx_http_push_stream_get_formatted_message (r=0xe043e10, channel=0x7f43ff0f0500, message=0x0) at ../nginx-push-stream-module-0.5.1/src/ngx_http_push_stream_module_utils.c:1303 pslcf = 0x23e8f80
1 0x000000000050912c in ngx_http_push_stream_send_response_message (r=0xe043e10, channel=0x7f43ff0f0500, msg=0x0, send_callback=1, send_separator=0)
2 0x0000000000507dfe in ngx_http_push_stream_delete_channels_data (data=0x7f43fd9af000)
3 0x000000000050a99f in ngx_http_push_stream_memory_cleanup () at ../nginx-push-stream-module-0.5.1/src/ngx_http_push_stream_module_utils.c:1048
... (gdb) p *message Cannot access memory at address 0x0 `
at core dump dmesg also says:
nginx-push[5763]: segfault at 68 ip 000000000050b54e sp 00007fffde6b6270 error 4 in nginx-push[400000+149000] Out of socket memory Out of socket memory ...
and error log is interesting:
2016/02/29 01:21:55 [alert] 5762#0: worker process 5763 exited on signal 11 (core dumped) 2016/02/29 09:01:59 [alert] 5762#0: worker process 10670 exited on signal 9 2016/02/29 09:01:59 [alert] 5762#0: shared memory zone "push_stream_module_global" was locked by 10670 2016/02/29 09:02:00 [alert] 5762#0: worker process 5765 exited on signal 9 2016/02/29 09:02:00 [alert] 5762#0: worker process 5772 exited on signal 9 2016/02/29 09:02:00 [alert] 5762#0: worker process 5774 exited on signal 9
so I would like to raise two questions:
thank you very much in advance