wandenberg / nginx-push-stream-module

A pure stream http push technology for your Nginx setup. Comet made easy and really scalable.
Other
2.22k stars 295 forks source link

0.5.1 segfaults on ngx_http_push_stream_get_formatted_message #228

Closed abregar closed 7 years ago

abregar commented 8 years ago

This happens on one of my sites (nginx 1.2.8, luajit-2.0.4,lua-nginx-module-0.9.16,nginx-push-stream-module-0.5.1)

`#0 0x000000000050b54e in ngx_http_push_stream_get_formatted_message (r=0xe043e10, channel=0x7f43ff0f0500, message=0x0) at ../nginx-push-stream-module-0.5.1/src/ngx_http_push_stream_module_utils.c:1303 pslcf = 0x23e8f80

1 0x000000000050912c in ngx_http_push_stream_send_response_message (r=0xe043e10, channel=0x7f43ff0f0500, msg=0x0, send_callback=1, send_separator=0)

at ../nginx-push-stream-module-0.5.1/src/ngx_http_push_stream_module_utils.c:538
    str = 0x400
    pslcf = 0x23e8f80
    ctx = 0xe4068c8
    use_jsonp = 0
    rc = 0

2 0x0000000000507dfe in ngx_http_push_stream_delete_channels_data (data=0x7f43fd9af000)

at ../nginx-push-stream-module-0.5.1/src/ngx_http_push_stream_module_utils.c:123
    subscription = 0xca2f9f8
    subscriber = 0xe406a20
    channel = 0x7f43ff0f0500
    worker = 0x7f440252fa80
    cur_worker = 0x7f440252fa80
    cur = 0xca2fa08
    q = 0x7f43ff0f0528

3 0x000000000050a99f in ngx_http_push_stream_memory_cleanup () at ../nginx-push-stream-module-0.5.1/src/ngx_http_push_stream_module_utils.c:1048

    data = 0x7f43fd9af000
    global_data = 0x7f44070c6000
    q = 0x7f43fd9bd1e8

... (gdb) p *message Cannot access memory at address 0x0 `

at core dump dmesg also says: nginx-push[5763]: segfault at 68 ip 000000000050b54e sp 00007fffde6b6270 error 4 in nginx-push[400000+149000] Out of socket memory Out of socket memory ...

and error log is interesting: 2016/02/29 01:21:55 [alert] 5762#0: worker process 5763 exited on signal 11 (core dumped) 2016/02/29 09:01:59 [alert] 5762#0: worker process 10670 exited on signal 9 2016/02/29 09:01:59 [alert] 5762#0: shared memory zone "push_stream_module_global" was locked by 10670 2016/02/29 09:02:00 [alert] 5762#0: worker process 5765 exited on signal 9 2016/02/29 09:02:00 [alert] 5762#0: worker process 5772 exited on signal 9 2016/02/29 09:02:00 [alert] 5762#0: worker process 5774 exited on signal 9

so I would like to raise two questions:

thank you very much in advance

wandenberg commented 8 years ago

@abregar Can you move to the master version ? (a new version will be released soon) The shmem is not unlocked after worker crash. Even it was, it can be in an inconsistent state. Is better to restart the nginx server.

abregar commented 8 years ago

I have been just trying to verify myself that mem is not unlocked after raising SIGSEGV at that particular spot. Is there any particular improvement regarding this particular issue in the master (eq. checking if message is null or smth)? I can give it a try if there is.

wandenberg commented 8 years ago

There were some changes regarding memory allocation on message creation process. Specially on the last commit, that can use an uncleaned memory area.

abregar commented 8 years ago

ok, tnx for the hint. But according to core dump of mine, msg allocation happens on a different point for this case. It is channel_deleted msg on shmem cleanup, looks like allocated at ngx_http_push_stream_convert_char_to_msg_on_shared (l.262 utils) Should this one also be calloc?

Other question - how can one monitor shared memory usage size?

abregar commented 8 years ago

Is it possible some edge case where creating channel and subscriber success, but with no space left in shmem for chanel_delete_msg, resulting in null msg allocation. Such channel will segfault at cleanup. Afaics ngnix should return crit logline 'ngx_slab_alloc() failed: no memory' but greeping the logs I cant see any.

wandenberg commented 8 years ago

@abregar the code yoou are using have changed and will be difficult to match the core dump with the current code, this is one of the reasons I asked to use the master version. The other is that it has important fixes.

A problem on allocating memory shouldn't cause a problem on cleanup, if this is happening is a bug. (one of this situations was fixed on the commit I have mentioned)

About monitor shared memory usage as far as I know there is no way to do that. For instance, when we ask to store 2 bytes on shm may be that nginx fill up a 8 bytes bucket. So to have a proper value of memory usage nginx core functions should provide it, but they don't.

Please, move on to master code, and if you still have core dumps I will be happy to debug and provide a fix if the problem is on the module.

thisiskj commented 8 years ago

I am experiencing the same issue on one of my servers. I am running nginx 1.9.11, using the push-stream-module on commit 6b52c9360ceea4d34af88117d488894423fc5ab8, which is currently 3 commits behind master. Here is the full backtrace:

Reading symbols from ../sbin/nginx...done.
[New LWP 11566]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `nginx: worker process   '.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000048a293 in ngx_http_push_stream_get_formatted_message (message=0x0, channel=0x0, r=0x26cb050) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1359
1359         return message->formatted_messages + pslcf->message_template_index - 1;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 nss-pam-ldapd-0.8.13-8.el7.x86_64 nss-softokn-freebl-3.16.2.3-13.el7_1.x86_64 openssl-libs-1.0.1e-51.el7_2.2.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64

(gdb) backtrace full
#0  0x000000000048a293 in ngx_http_push_stream_get_formatted_message (message=0x0, channel=0x0, r=0x26cb050) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1359
        pslcf = 0x26ae898
#1  ngx_http_push_stream_send_response_message (r=0x26cb050, channel=channel@entry=0x7f6d6a8ed000, msg=0x0, send_callback=send_callback@entry=1, send_separator=send_separator@entry=0)
    at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:579
        pslcf = <optimized out>
        ctx = 0x26cbe30
        use_jsonp = 0
        rc = 0
#2  0x000000000048dbb1 in ngx_http_push_stream_delete_channels_data (data=data@entry=0x7f6d6a8dd000) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:126
        subscription = <optimized out>
        subscriber = 0x26cbf18
        mcf = 0x266aeb0
        channel = 0x7f6d6a8ed000
        worker = 0x7f6d6a8ec0c0
        cur_worker = 0x7f6d6a8ec0c0
        cur = <optimized out>
        q = 0x7f6d6a8ed028
#3  0x000000000048e6fd in ngx_http_push_stream_memory_cleanup () at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1104
        data = 0x7f6d6a8dd000
        global_data = 0x7f6d6f9a8000
        q = 0x7f6d6a8eb1e8
#4  ngx_http_push_stream_memory_cleanup_timer_wake_handler (ev=<optimized out>) at ../packages/nginx-push-stream-module/src/ngx_http_push_stream_module_utils.c:1312
No locals.
#5  0x0000000000429bb5 in ngx_event_expire_timers () at src/event/ngx_event_timer.c:94
        ev = 0x6dd620 <ngx_http_push_stream_memory_cleanup_event>
        root = 0x26696c0
        sentinel = 0x6cea80 <ngx_event_timer_sentinel>
#6  0x0000000000429828 in ngx_process_events_and_timers (cycle=cycle@entry=0x265e950) at src/event/ngx_event.c:256
        flags = 1
        timer = <optimized out>
        delta = 381
#7  0x000000000042f429 in ngx_worker_process_cycle (cycle=0x265e950, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:753
        worker = <optimized out>
#8  0x000000000042dc76 in ngx_spawn_process (cycle=cycle@entry=0x265e950, proc=proc@entry=0x42f3a8 <ngx_worker_process_cycle>, data=data@entry=0x0, name=name@entry=0x49550d "worker process", respawn=respawn@entry=-3)
    at src/os/unix/ngx_process.c:198
        on = 1
        pid = 0
        s = 0
#9  0x000000000042e7db in ngx_start_worker_processes (cycle=cycle@entry=0x265e950, n=16, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:358
        i = 0
        ch = {command = 1, pid = 0, slot = 0, fd = 0}
#10 0x000000000042fb25 in ngx_master_process_cycle (cycle=cycle@entry=0x265e950) at src/os/unix/ngx_process_cycle.c:130
        title = 0x26e2534 "master process ../sbin/nginx"
        p = <optimized out>
        size = <optimized out>
        i = <optimized out>
        n = <optimized out>
        sigio = <optimized out>
        set = {__val = {0 <repeats 16 times>}}
        itv = {it_interval = {tv_sec = 40771296, tv_usec = 0}, it_value = {tv_sec = 0, tv_usec = 0}}
        live = <optimized out>
        delay = <optimized out>
        ls = <optimized out>
        ccf = 0x2660478
#11 0x000000000041168d in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:367
---Type <return> to continue, or q <return> to quit--- 
        b = <optimized out>
        log = 0x6cc400 <ngx_log>
        i = <optimized out>
        cycle = 0x265e950
        init_cycle = {conf_ctx = 0x0, pool = 0x265e450, log = 0x6cc400 <ngx_log>, new_log = {log_level = 0, file = 0x0, connection = 0, disk_full_time = 0, handler = 0x0, data = 0x0, writer = 0x0, wdata = 0x0, action = 0x0, 
            next = 0x0}, log_use_stderr = 0, files = 0x0, free_connections = 0x0, free_connection_n = 0, modules = 0x0, modules_n = 0, modules_used = 0, reusable_connections_queue = {prev = 0x0, next = 0x0}, listening = {elts = 0x0, 
            nelts = 0, size = 0, nalloc = 0, pool = 0x0}, paths = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, config_dump = {elts = 0x0, nelts = 0, size = 0, nalloc = 0, pool = 0x0}, open_files = {last = 0x0, part = {
              elts = 0x0, nelts = 0, next = 0x0}, size = 0, nalloc = 0, pool = 0x0}, shared_memory = {last = 0x0, part = {elts = 0x0, nelts = 0, next = 0x0}, size = 0, nalloc = 0, pool = 0x0}, connection_n = 0, files_n = 0, 
          connections = 0x0, read_events = 0x0, write_events = 0x0, old_cycle = 0x0, conf_file = {len = 29, data = 0x265e4a0 "../conf/nginx.conf"}, conf_param = {len = 0, data = 0x0}, conf_prefix = {len = 19, 
            data = 0x265e4a0 "../conf/nginx.conf"}, prefix = {len = 14, data = 0x491402 "../"}, lock_file = {len = 0, data = 0x0}, hostname = {len = 0, data = 0x0}}
        cd = <optimized out>
        ccf = <optimized out>
(gdb) q
wandenberg commented 8 years ago

@kjad please, update your server at least to 4aa2abdc420abf14a6c02e630763442bb1ea4e56 this should solve the segmentation fault that you had

thisiskj commented 8 years ago

@wandenberg Thank you for the follow up. Per your recommendation, I re-compiled using the latest commit b2056f63c27117952afc82d7a295acf0054edc28, but I am still seeing the same error. Still running nginx 1.9.11. It seems to happen approximately once a week on low traffic days. I can provide another backtrace if you request, but it looks the same as the previous post.

wandenberg commented 8 years ago

@kjad please create a new coredump and make it available for me with your binary together.

abregar commented 8 years ago

I was on some other stuff and unable to try latest master build, but I see @kjad followed up with most recent build and still same issue occurs. ( @kjad were you able to produce coredump?)

I would like just to consider chance that 'Out of socket memory' is the possible issue here due I see that in dmesg when crash happens. Reading kernel sources on tcp, this message may occur only at single point and that is at 'tcp_out_of_resources' method. And there are two cases for that method to return true:

@wandenberg I see you have added some logging into the br shm_alloc_error - is this the br which actually tries to solve also this issue?

thisiskj commented 8 years ago

@wandenberg I cannot provide the coredump file and binary, but I can run any gdb/debugging commands you might need. Give me a list of commands, and I will provide the output.

wandenberg commented 8 years ago

@abregar and @kjad can you try to use the shm_alloc_error and provide backtrace and, if possible, coredump when the error happens. This branch have some log messages that will help to track the problem and also a slightly change on some allocate function call that can fix the issue.

thisiskj commented 8 years ago

@wandenberg I will switch our servers to use the shm_alloc_error branch next week and provide the logs when the error happens. Thanks again.

abregar commented 8 years ago

Was not able to mess with prod, but finally got focus into this a bit deeper. Was able to model my use case in lab and reliably reproduced the issue. It is like this:

` push stream module: unable to allocate root structure message in shared memory

*7625 push stream module: unable to allocate memory to channel deleted message, `

see shm_additional_log.zip

wandenberg commented 8 years ago

@abregar can you send me the script you used to reproduce the problem? Will make the debug and fix easier.

abregar commented 8 years ago

Sure. Run sub_stress.sh and pub_stress.sh until nginx starts complain on shm allocation problem. Then run delete.sh in piexoto.sh.zip

Is there any particular reason that channel->channel_deleted_message is tried to be allocated on shm only at the delete occurrence (which will fail if there is no space left). Wouldn't it benefit from single message static instance?

wandenberg commented 7 years ago

Hi @abregar and @kjad can you test the code on this branch https://github.com/wandenberg/nginx-push-stream-module/tree/issue228 ? @abregar the channel_deleted_message is allocate on shm only at delete occurrence for two reasons: 1) is not a feature widely used and this way we save memory (and time) from applications that does not use this feature. 2) you can specify a custom delete channel text to be sent on the message.

(Sorry for the long delay, I'm returning the development/test of this module after a long period of inactivity)