wandenberg / nginx-push-stream-module

A pure stream http push technology for your Nginx setup. Comet made easy and really scalable.
Other
2.22k stars 295 forks source link

Sporadic errors while allocating memory for payload #218

Closed rmihael closed 8 years ago

rmihael commented 8 years ago

Sometimes we see errors like these in NGINX logs:

2015/12/11 10:31:13 [emerg] 1656#0: *83643 malloc(4160122685316607682) failed (12: Cannot allocate memory), client: 199.172.221.99, server: , request: "GET /push/ws/c4ea1a2fecd7f8b96acf5e3b5/c4ea1a2fecd7f8b96acf5e3b5u565d9eab744376352e13cb35?_=1449829322760&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2015/12/11 10:31:13 [error] 1656#0: *83643 push stream module: unable to allocate memory for payload, client: 199.172.221.99, server: , request: "GET /push/ws/c4ea1a2fecd7f8b96acf5e3b5/c4ea1a2fecd7f8b96acf5e3b5u565d9eab744376352e13cb35?_=1449829322760&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2015/12/11 10:55:07 [emerg] 1656#0: *85718 malloc(13441697687894719624) failed (12: Cannot allocate memory), client: 124.104.174.156, server: , request: "GET /push/ws/c54bda86d647536b36ff48ff9/c54bda86d647536b36ff48ff9u559bb48c5424db4d5ed01395?_=1449830828585&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2015/12/11 10:55:07 [error] 1656#0: *85718 push stream module: unable to allocate memory for payload, client: 124.104.174.156, server: , request: "GET /push/ws/c54bda86d647536b36ff48ff9/c54bda86d647536b36ff48ff9u559bb48c5424db4d5ed01395?_=1449830828585&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"

That correlates with failures to deliver message to clients that we sometimes observe.

Here is some additional information:

NGINX server is built in a following way:

# nginx -V
nginx version: nginx/1.8.0
built by gcc 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC)
built with OpenSSL 1.0.1k-fips 8 Jan 2015
TLS SNI support enabled
configure arguments: --add-module=../nginx-push-stream-module-0.5.1/ --with-http_ssl_module --with-http_gzip_static_module --with-http_stub_status_module --prefix=/usr

Our memory configuration looks like this:

push_stream_shared_memory_size 512M;
push_stream_message_ttl 600; # 10 minutes

That's what we have from channel stats:

{"hostname": "....", "time": "2015-12-11T12:37:56", "channels": "1432", "wildcard_channels": "0", "published_messages": "71503", "stored_messages": "721", "messages_in_trash": "19", "channels_in_trash": "17", "subscribers": "654", "uptime": "61976", "by_worker": [
{"pid": "1654", "subscribers": "145", "uptime": "14994"},
{"pid": "1655", "subscribers": "132", "uptime": "14994"},
{"pid": "1656", "subscribers": "251", "uptime": "14994"},
{"pid": "1657", "subscribers": "126", "uptime": "14994"}
]}
wandenberg commented 8 years ago

@rmihael this message is not related with shared memory. It is an allocation on normal memory. How is your server memory? Are you having swap? How big is the message being posted? Try to replace the line ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "push stream module: unable to allocate memory for payload"); by ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "push stream module: unable to allocate memory for payload for %d bytes", ctx->frame->payload_len);

to help investigate the message size.

rmihael commented 8 years ago

Server memory seems to be in a good shape:

# free -m
             total       used       free     shared    buffers     cached
Mem:          1655       1470        184          0        329        566
-/+ buffers/cache:        573       1081
Swap:         1919          6       1913

I don't think that we are sending any messages that are bigger then 20 KB. I'll recompile nginx with a change you proposed and get back with accurate measurements.

rmihael commented 8 years ago

Actually I think that it may be useful to have a full payload that failed allocation. Can you suggest a log line that will dump it? Sorry for asking, but it's been a long time since I last used C for anything.

wandenberg commented 8 years ago

At this time the module does not have the payload yet. It is allocating memory to read it from socket.

rmihael commented 8 years ago

That's what I got in logs after changing line by your suggestion:

2015/12/16 10:56:14 [error] 5312#0: *2450 push stream module: unable to allocate memory for payload for -2146963871 bytes

It's quite strange, I don't think that we have any message of a size that exceeds signed integer range.

rmihael commented 8 years ago

Here are more logs for a last few hours:

2015/12/16 10:56:14 [error] 5312#0: *2450 push stream module: unable to allocate memory for payload for -2146963871 bytes
2015/12/16 11:01:17 [error] 5310#0: *3495 push stream module: unable to allocate memory for payload for 201314684 bytes
2015/12/16 11:25:07 [error] 5310#0: *6125 push stream module: unable to allocate memory for payload for -2145296493 bytes
2015/12/16 11:28:00 [error] 5311#0: *6407 push stream module: unable to allocate memory for payload for 1653362954 bytes
2015/12/16 11:37:25 [error] 5310#0: *7405 push stream module: unable to allocate memory for payload for -2090981240 bytes
2015/12/16 11:37:39 [error] 5313#0: *6489 push stream module: unable to allocate memory for payload for -1628155000 bytes
2015/12/16 12:02:50 [error] 5310#0: *9403 push stream module: unable to allocate memory for payload for 1773878779 bytes
2015/12/16 12:17:30 [error] 5311#0: *11165 push stream module: unable to allocate memory for payload for 839370131 bytes

For completeness sake I'll add logging for payload size on our application side.

rmihael commented 8 years ago

Profiled payload sizes on application side. There are some bigger messages sized ~64KB, but their count is less then 10 per hour. Absolute majority of messages are <1KB in size. Nothing that scales up to megabytes.

wandenberg commented 8 years ago

@rmihael I'm guessing that this is related with partial frames that websocket RFC describes but is not supported on the module yet. I will prepare some log messages and ask you to add at your server to debug. Can you do that for me?

rmihael commented 8 years ago

Definitely. Looking forward for these messages.

wandenberg commented 8 years ago

@rmihael can you try to use the code on payload_check branch? It will log the received size and the one it choose to use.

wandenberg commented 8 years ago

@rmihael Did you have opportunity to test this?

rmihael commented 8 years ago

We deployed payload_check branch today, but haven't seen relevant logs yet. I'll update this issue as soon as we'll have something.

rmihael commented 8 years ago

After deploying payload_check branch we had zero allocation errors for a full three days. Usually we have ~10 per day. Either something in the branch fixed this problem or something broke error logging.

wandenberg commented 8 years ago

Great.

Just to be sure. Are you using the 876508f83676122ffbd4253ae11ae8ef8968601b, right? If yes, "rollback" to baa1e9adda4c3114838369375c9416b02f5e4aa7. If the errors returns, means that the it was fixed by the "fragmented frames" support added on the last commit.

rmihael commented 8 years ago

Strange, we used baa1e9a, but still see no allocation errors. There are some errors, but seems that they are irrelevant:

2016/01/11 17:18:43 [error] 1112#0: *3304 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 158.96.4.13, server: , request: "GET /push/ws/c50a921c032cc4e4b8c92f9d0/c50a921c032cc4e4b8c92f9d0u562f8f297443765254f5dad1?_=1452532713640&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2016/01/11 23:56:40 [error] 1110#0: *12460 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 158.96.4.13, server: , request: "GET /push/ws/c50a921c032cc4e4b8c92f9d0/c50a921c032cc4e4b8c92f9d0u562f8f297443765254f5dad1?_=1452556599993&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2016/01/12 14:39:49 [error] 1111#0: *28350 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 162.218.214.1, server: , request: "GET /push/ws/c565862ab7eb0256e34a5c58e/c565862ab7eb0256e34a5c58eu568d65282007d00bc67aaf4d?_=1452609588384&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2016/01/12 14:47:34 [error] 1113#0: *28382 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 162.218.214.1, server: , request: "GET /push/ws/c565862ab7eb0256e34a5c58e/c565862ab7eb0256e34a5c58eu568d65282007d00bc67aaf4d?_=1452610053759&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2016/01/12 16:29:03 [error] 1113#0: *31824 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 158.96.4.13, server: , request: "GET /push/ws/c50a921c032cc4e4b8c92f9d0/c50a921c032cc4e4b8c92f9d0u562f8f297443765254f5dad1?_=1452616142105&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
2016/01/12 16:30:14 [error] 1113#0: *31861 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 158.96.4.13, server: , request: "GET /push/ws/c50a921c032cc4e4b8c92f9d0/c50a921c032cc4e4b8c92f9d0u562f8f297443765254f5dad1?_=1452616211531&tag=&time=&eventid= HTTP/1.1", host: "push.nimble.com"
rmihael commented 8 years ago

baa1e9a continues to work without unable to allocate memory for payload errors, but today we got a bunch of errors related to a shared memory. They persisted until we restarted nginx process.

2016/01/19 23:04:01 [crit] 1112#0: ngx_slab_alloc() failed: no memory
2016/01/19 23:04:01 [error] 1112#0: *216475 push stream module: unable to allocate message in shared memory, client: 111.222.113.114, server: , request: "POST /push/pub?id=c4d1be88768aa51a15f1beb0eu5386568bf93a5f2f261efb93 HTTP/1.1", host: "push.host.com"
wandenberg commented 8 years ago

Hi @rmihael the lack of memory can be that you set a small memory to the needs of your application, or due to a misinterpretation of the payload size that result in a big number, but non-negative. We can set to always log the payload size and check if all of it are in the expected size. If yes, may be necessary to adjust your configurations. If not, may be that just moving back to the 876508f really solve the wrong parser of payload size. What do you think? Can we do this test?

rmihael commented 8 years ago

During past few days we saw the following errors:

2016/01/21 18:42:31 [crit] 20706#0: ngx_slab_alloc() failed: no memory
2016/01/21 18:42:31 [alert] 20706#0: ngx_slab_free(): pointer to wrong chunk
2016/01/21 18:42:31 [error] 20706#0: *51819 push stream module: unable to allocate message in shared memory, client: 10.83.10.24, server: , request: "POST /push/pub?id=c4c2118ad5
4397f271b000000u4c21187154397f270a000001 HTTP/1.1", host: "host.com"
2016/01/23 00:15:50 [crit] 20707#0: ngx_slab_alloc() failed: no memory
2016/01/23 00:15:50 [alert] 20707#0: ngx_slab_free(): outside of pool
2016/01/23 00:15:50 [alert] 20707#0: ngx_slab_free(): pointer to wrong chunk
2016/01/23 00:15:50 [alert] 20707#0: ngx_slab_free(): pointer to wrong page
2016/01/23 00:15:50 [error] 20707#0: *91021 push stream module: unable to allocate message in shared memory, client: 10.30.9.176, server: , request: "POST /push/pub?id=c4c2118ad54397f271b000000u4c21187154397f270a000001 HTTP/1.1", host: "host.com"
2016/01/27 13:46:04 [crit] 20706#0: ngx_slab_alloc() failed: no memory
2016/01/27 13:46:04 [error] 20706#0: *200800 push stream module: unable to allocate message in shared memory, client: 10.204.142.145, server: , request: "POST /push/pub?id=c4c65bc2106d795162e000038u4c65bc2106d795162e000036 HTTP/1.1", host: "host.com"
2016/01/27 13:46:04 [error] 20706#0: *200801 push stream module: unable to allocate message in shared memory, client: 10.204.142.145, server: , request: "POST /push/pub?id=c4c65bc2106d795162e000038u4c65bc2106d795162e000036 HTTP/1.1", host: "host.com"
2016/01/27 13:47:32 [alert] 20707#0: ngx_slab_free(): page is already free

That's happening on baa1e9a.

I would be happy to track it down. We can deploy version with more verbose logging if you provide one.

wandenberg commented 8 years ago

Hi @rmihael can you test the commit 8d70918c14a8a74fcebe0f21b3a7019aff905ade ? I fixed a potencial error point during message template apply process. And added some log messages to help us investigate the problem. Which value have you set for shared message size? My guess is that it is not compatible with the publish message rate of your application. Do you have a hangout or skype user?

rmihael commented 8 years ago

Hi @wandenberg, sorry for not responding, I was away for a few days.

Today we deployed 8d70918. No errors so far and no relevant log messages. Shared memory size set to 64M. We can increase it, but out messages traffic is fairy low (peaking at 100 per 5 minutes), so I don't think that we are really hitting this limit. Payload size is small: all messages are below 40K and 90% are below 2K.

I'm mykhailo.korbakov on Skype and available for a chat during European day and evening.