wandenberg / nginx-push-stream-module

A pure stream http push technology for your Nginx setup. Comet made easy and really scalable.
Other
2.22k stars 295 forks source link

(heisenbug?) messages in trash not cleared #212

Closed ghost closed 8 years ago

ghost commented 8 years ago

Hi I using your module with following configuration:

push_stream_shared_memory_size  512M;
push_stream_channel_deleted_message_text  "Channel deleted";
push_stream_channel_inactivity_time  30s;
push_stream_message_ttl 5m;
nginx version: nginx/1.8.0
built with OpenSSL 0.9.8k 25 Mar 2009
TLS SNI support enabled
configure arguments: --prefix=/data/soft/nginx --with-http_ssl_module --with-http_realip_module
--with-http_dav_module --with-http_flv_module --with-http_stub_status_module
--with-http_sub_module --with-http_image_filter_module --with-http_secure_link_module
--add-module=ngx_cache_purge-2.3 --add-module=ngx_devel_kit-v0.2.19
--add-module=set-misc-nginx-module-v0.29 --add-module=ngx_http_pinba_module-11ecb3e6f7
--add-module=nginx-push-stream-module-0.5.1
--add-module=echo-nginx-module-v0.58 --add-module=memc-nginx-module-v0.16
--add-module=nginx-let-module-v0.0.4 --with-file-aio

It works well, but sometimes (not in 100% cases), after sending HUP signal to the nginx, trash stops cleaning. Trash queue grows up to allowed memory in configuration directive, after that I see messages:

2015/10/20 00:01:01 [crit] 18124#0: ngx_slab_alloc() failed: no memory
2015/10/20 00:01:01 [error] 18124#0: *7026696805 push stream module: unable to allocate message in shared memory, client: 172.16.176.132, server: example.com, request: "POST /stream/user/pub?id=EXAMPLE-messages-275659b6f7d6a683170673b07120144a HTTP/1.0", host: "example.com"
2015/10/20 00:01:01 [crit] 18127#0: ngx_slab_alloc() failed: no memory
2015/10/20 00:01:01 [error] 18127#0: *7026696859 push stream module: unable to allocate message in shared memory, client: 172.16.176.132, server: example.com, request: "POST /stream/user/pub?id=EXAMPLE-messages-275659b6f7d6a683170673b07120144a HTTP/1.0", host: "example.com"
2015/10/20 00:01:01 [crit] 18136#0: ngx_slab_alloc() failed: no memory

My assumption is that in some cases trash can't be cleared

As a temprorary fix, I wrote next patch (force clear):

--- src/ngx_http_push_stream_module_utils.c.orig    2015-10-20 15:24:46.458476715 +0600
+++ src/ngx_http_push_stream_module_utils.c 2015-10-20 15:33:14.100090301 +0600
@@ -1105,7 +1105,7 @@
         if (ngx_shmtx_trylock(&data->cleanup_mutex)) {
             ngx_http_push_stream_collect_deleted_channels_data(data);
             ngx_http_push_stream_collect_expired_messages_and_empty_channels_data(data, 0);
-            ngx_http_push_stream_free_memory_of_expired_messages_and_channels_data(data, 0);
+            ngx_http_push_stream_free_memory_of_expired_messages_and_channels_data(data, 1);
             ngx_shmtx_unlock(&data->cleanup_mutex);
         }
     }

So, my questions:

Thanks in advance!

wandenberg commented 8 years ago

Hi @theairkit

according with your description seems that during the reload process (HUP) at least one of the nginx workers failed to clean the references to it on the shared memory, and this could be caused by any of the modules you are using. All workers must exit with "0" after the HUP signal to keep things working properly.

Try to enable the debug to generate coredumps and be possible to check which module is causing the worker dies incorrectly.

Answering your questions:

ghost commented 8 years ago

@wandenberg thanks for answer and explanation!

So, seems i've found the reason of this behavior: an old-old cron, killing nginx workers in 'shutdown' state more than 30 seconds...

Right now i disable it.

Sorry for this 'false' issue, and thanks again!

seletskiy commented 8 years ago

@wandenberg: Actually, problem is little bit complex. We have fast release cycle and it's common case to reload nginx dozens of times per day. With that frequent reloads workers in shutdown state (which are serving slow clients) keep accumulating through the day, and we urge to forcefully kill them.

So, we caught between the devil and the deep blue sea: (a) either have slow workers hanging around; (b) or have memory leaks due breaking cleanup routines of push-stream-module because of improper killing of workers.

I guess, it will be nice to be able to manually cleanup trashed messages?

wandenberg commented 8 years ago

@seletskiy as I said before, change the cleanup routines behavior can case undesired and unknown side effects and should be done very carefully.

Which value did you set for push_stream_subscriber_connection_ttl ? The default value is "never" letting the connections live forever. Try to set some value compatible with your reload frequency, lets say 15 minutes. With that all subscribers will be disconnected after that time and your worker will be able to complete the reload process in at most that time, no matter if is a slow client or not.

Just for curiosity, do you really need to reload a nginx used for a data streaming with that frequency? Reloads are necessary only when the configuration has changed. I would suggest to split up your setup in two services, one for streaming and one for the configuration that needs to be reloaded frequently, if possible.