wandenberg / nginx-push-stream-module

A pure stream http push technology for your Nginx setup. Comet made easy and really scalable.
Other
2.22k stars 295 forks source link

Infinite loop in ngx_http_push_stream_collect_expired_messages_and_empty_channels() #53

Closed janisdz closed 11 years ago

janisdz commented 12 years ago

nginx with your module was running flawlessly for a month handling more than 10000 channels and 100s of 1000s of subscribers. Recently I noticed that one of my 2 nginx workers was stuck at 100% cpu. Doing some debugging via gdb revealed infinite loop in ngx_http_push_stream_rbtree_insert() The second worker was stuck at trying to acquire shared memory lock. This rendered the whole server unavailable.

Here is the stacktrace: (gdb) bt

0 0x0000000000471791 in ngx_http_push_stream_rbtree_insert ()

1 0x0000000000413520 in ngx_rbtree_insert ()

2 0x000000000047ad71 in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

3 0x000000000047adbd in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

4 0x000000000047adbd in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

5 0x000000000047adbd in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

6 0x000000000047adbd in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

7 0x000000000047adbd in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

8 0x000000000047adbd in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

9 0x000000000047ab75 in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

10 0x000000000047ab75 in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

11 0x000000000047ab75 in ngx_http_push_stream_collect_expired_messages_and_empty_channels ()

12 0x000000000047ae6c in ngx_http_push_stream_memory_cleanup_timer_wake_handler ()

13 0x0000000000420e8c in ngx_event_expire_timers ()

14 0x0000000000420c7f in ngx_process_events_and_timers ()

15 0x0000000000426e78 in ngx_worker_process_cycle ()

16 0x00000000004253b7 in ngx_spawn_process ()

17 0x0000000000427a47 in ngx_master_process_cycle ()

18 0x000000000040b87b in main ()

After doing some disassembly turned out the code is stuck inside ngx_rbtree_generic_insert():

...... static void ngx_rbtree_generic_insert(ngx_rbtree_node_t _temp, ngx_rbtree_node_t node, ngx_rbtree_node_t sentinel, int (_compare) (const ngx_rbtree_node_t left, const ngx_rbtree_node_t right)) { for (;;) { if (node->key < temp->key) { if (temp->left == sentinel) { <---- This never becomes true because (temp->left == temp) temp->left = node; break; } temp = temp->left; } else if (node->key > temp->key) { if (temp->right == sentinel) { .......

I attribute it to some problem with locking the shared memory which leads to memory stomp, so I'll try to run nginx with only one worker and see if this will happen again. This looks like a really nasty and hard-to-repeat bug.

wandenberg commented 12 years ago

Hi,

did you cut the answer of bt? Because I didn't see any infinite loop on this part you showed. What I see is some recursion calls to ngx_http_push_stream_collect_expired_messages_and_empty_channels, which is normal. And about the code you point as part of the problem, it don't have a direct relation with the function ngx_http_push_stream_collect_expired_messages_and_empty_channels, and it is searching for the leafs on an rbtree to add a new node, so in some point it will find a node where ->left or ->right will be equals to the sentinel.

If you are handling 100s of 1000s subscribers should be better to increase the number of workers, not decrease, right?

If you repeat the problem try to generate a core dump and send to me, please.

janisdz commented 12 years ago

Ok, some more details: This is release build, so some functions are inlined/optimized and do not appear on stack. ngx_http_push_stream_rbtree_insert() calls ngx_rbtree_generic_insert(). There is a for loop in ngx_rbtree_generic_insert() and a fragment of it is shown in the code above (indents were stripped off on submit). It did not break out of the for loop because for some unknown reason in my case variable temp->left was equal to temp.

There is no need to increase worker count as the resources taken by nginx is negligible comparing to other stuff running on the machine.

wandenberg commented 11 years ago

Hi,

I did a refactor on this and some other functions which go through all channels to execute some job, to not do that with a tree structure and use a queue instead, avoiding some strange situations on race conditions, you can check this here https://github.com/wandenberg/nginx-push-stream-module/commit/152326abcba6becea1d56f3592bcb85fbe42c3bc