tatsuhiro-t / spdylay

The experimental SPDY protocol version 2, 3 and 3.1 implementation in C
http://tatsuhiro-t.github.io/spdylay/
MIT License
603 stars 102 forks source link

shrpx errors cause the server to hang #79

Closed n0wa11 closed 10 years ago

n0wa11 commented 10 years ago

I run spdylay 1.0.0 on Ubuntu 12.04, with spdy/3 turned off.

shrpx runs in front of haproxy. The errors occur on the busiest server -- actually the loan is not very high (<0.5) and bandwidth < 10Mbps, when the errors occur. So I haven't been able to find a way to reproduce the errors.

At first, there were the following connection errors.

[^[[1;33mWARN^[[0m] [UPSTREAM:0x7fe12c611240] Failed to send control frame type=2, error_code=-510:Stream was already closed or invalid
       ^[[1;30m(shrpx_spdy_upstream.cc:273)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe1389e42f0] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [UPSTREAM:0x7fe12c611240] Failed to send control frame type=2, error_code=-510:Stream was already closed or invalid
       ^[[1;30m(shrpx_spdy_upstream.cc:273)^[[0m
[^[[1;33mWARN^[[0m] [UPSTREAM:0x7fe12c3d8450] Failed to send control frame type=2, error_code=-512:The transmission is not allowed for this stream
       ^[[1;30m(shrpx_spdy_upstream.cc:273)^[[0m
[^[[1;33mWARN^[[0m] [UPSTREAM:0x7fe12c3d8450] Failed to send control frame type=2, error_code=-512:The transmission is not allowed for this stream
       ^[[1;30m(shrpx_spdy_upstream.cc:273)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13c88e630] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13cc578c0] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13cf01540] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe14ce09060] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13cf78600] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13c70cf30] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13cbe61d0] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13c1bf4b0] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [UPSTREAM:0x7fe144950cf0] Failed to send control frame type=2, error_code=-512:The transmission is not allowed for this stream
       ^[[1;30m(shrpx_spdy_upstream.cc:273)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe1446efe60] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe13c8b28f0] dconn_ is NULL
       ^[[1;30m(shrpx_downstream.cc:303)^[[0m
[^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7fe1440a1c60] dconn_ is NULL

Then, the CPU usage goes up to 100% and stays there. So the server just stops responding to the requests.

[^[[1;31mERROR^[[0m] Accepting incoming connection failed
       ^[[1;30m(shrpx.cc:123)^[[0m
[^[[1;31mERROR^[[0m] Accepting incoming connection failed
       ^[[1;30m(shrpx.cc:123)^[[0m
[^[[1;31mERROR^[[0m] Accepting incoming connection failed
       ^[[1;30m(shrpx.cc:123)^[[0m
[^[[1;31mERROR^[[0m] Accepting incoming connection failed
       ^[[1;30m(shrpx.cc:123)^[[0m
[^[[1;31mERROR^[[0m] Accepting incoming connection failed
       ^[[1;30m(shrpx.cc:123)^[[0m

Once I restart shrpx, shrpx goes back to normal. I am getting the errors once or twice every day now.

Do you have any ideas of the cause with the information provided?

tatsuhiro-t commented 10 years ago

I experienced this error once, although I could not reproduce it so far. 8f73303 may fix this issue.

n0wa11 commented 10 years ago

Thanks. Let me observe for a few days.

berstend commented 10 years ago

Hi, I think I'm encountering the same or a very similar issue.

The problem also only occurs on instances with a bit of load (~30Mbit/s, ~120 unique connections) although my CPU usage doesn't go up to 100% (it only increases by 3-4%).

Relevant shrpx logging excerpts:

[WARN] [UPSTREAM:0x7f6e70174840] Failed to send control frame type=2, error_code=-510:Stream was already closed or invalid
       (shrpx_spdy_upstream.cc:273)
[WARN] [UPSTREAM:0x7f6e70174840] Failed to send control frame type=2, error_code=-510:Stream was already closed or invalid
       (shrpx_spdy_upstream.cc:273)
[WARN] [UPSTREAM:0x7f6e70174840] Failed to send control frame type=2, error_code=-510:Stream was already closed or invalid
       (shrpx_spdy_upstream.cc:273)
[WARN] [UPSTREAM:0x7f6e70174840] Failed to send control frame type=2, error_code=-510:Stream was already closed or invalid
       (shrpx_spdy_upstream.cc:273)
[INFO] [UPSTREAM:0x7f6e70174840] Stream stream_id=659 is being closed
       (shrpx_spdy_upstream.cc:99)
[INFO] [DCONN:0x7f6e70149b50] Detaching from DOWNSTREAM:0x7f6e700bb240
       (shrpx_http_downstream_connection.cc:308)
[INFO] [CLIENT_HANDLER:0x7f6e700e3350] Pooling downstream connection DCONN:0x7f6e70149b50
       (shrpx_client_handler.cc:259)
[INFO] [DOWNSTREAM:0x7f6e700bb240] Deleting
       (shrpx_downstream.cc:67)
[INFO] [DOWNSTREAM:0x7f6e700bb240] Deleted
       (shrpx_downstream.cc:77)
[INFO] [UPSTREAM:0x7f6e70174840] Stream stream_id=577 is being closed
       (shrpx_spdy_upstream.cc:99)
[INFO] [DOWNSTREAM:0x7f6e94019800] HTTP response completed
       (shrpx_spdy_upstream.cc:875)
[INFO] [UPSTREAM:0x7f6e940229a0] Stream stream_id=393 is being closed
       (shrpx_spdy_upstream.cc:99)
[INFO] [DCONN:0x7f6e9403cd90] Detaching from DOWNSTREAM:0x7f6e94019800
       (shrpx_http_downstream_connection.cc:308)
[INFO] [CLIENT_HANDLER:0x7f6e94002490] Pooling downstream connection DCONN:0x7f6e9403cd90
       (shrpx_client_handler.cc:259)
[INFO] [DOWNSTREAM:0x7f6e94019800] Deleting
       (shrpx_downstream.cc:67)
[INFO] [DOWNSTREAM:0x7f6e94019800] Deleted
       (shrpx_downstream.cc:77)
[ERROR] Accepting incoming connection failed
       (shrpx.cc:123)
[ERROR] Accepting incoming connection failed
       (shrpx.cc:123)
[ERROR] Accepting incoming connection failed
       (shrpx.cc:123)
[ERROR] Accepting incoming connection failed
       (shrpx.cc:123)
[ERROR] Accepting incoming connection failed
       (shrpx.cc:123)

If the client is Google Chrome a lot of requests are terminated with an ERR_SPDY_PROTOCOL_ERROR error.

Here is the relevant output from chrome://net-internals:

4680557: URL_REQUEST
http://www.wieistmeineip.de/
Start Time: 2013-08-16 20:11:25.220

t=1376676685220 [st= 0] +REQUEST_ALIVE  [dt=41]
t=1376676685220 [st= 0]    URL_REQUEST_BLOCKED_ON_DELEGATE  [dt=2]
                           --> delegate = "Adblock Plus-Erweiterung"
t=1376676685222 [st= 2]    URL_REQUEST_START_JOB  [dt=0]
                           --> load_flags = 144834817 (ENABLE_LOAD_TIMING | MAIN_FRAME | MAYBE_USER_GESTURE | REPORT_RAW_HEADERS | VALIDATE_CACHE | VERIFY_EV_CERT)
                           --> method = "GET"
                           --> priority = 4
                           --> url = "http://www.wieistmeineip.de/"
t=1376676685222 [st= 2]   +URL_REQUEST_START_JOB  [dt=39]
                           --> load_flags = 144834817 (ENABLE_LOAD_TIMING | MAIN_FRAME | MAYBE_USER_GESTURE | REPORT_RAW_HEADERS | VALIDATE_CACHE | VERIFY_EV_CERT)
                           --> method = "GET"
                           --> priority = 4
                           --> url = "http://www.wieistmeineip.de/"
t=1376676685222 [st= 2]      HTTP_CACHE_GET_BACKEND  [dt=0]
t=1376676685222 [st= 2]      HTTP_CACHE_DOOM_ENTRY  [dt=1]
                             --> net_error = -2 (ERR_FAILED)
t=1376676685223 [st= 3]      HTTP_CACHE_CREATE_ENTRY  [dt=0]
t=1376676685223 [st= 3]      HTTP_CACHE_ADD_TO_ENTRY  [dt=0]
t=1376676685223 [st= 3]     +HTTP_STREAM_REQUEST  [dt=0]
t=1376676685223 [st= 3]        HTTP_STREAM_REQUEST_BOUND_TO_JOB
                               --> source_dependency = 4680588 (HTTP_STREAM_JOB)
t=1376676685223 [st= 3]     -HTTP_STREAM_REQUEST
t=1376676685223 [st= 3]     +HTTP_TRANSACTION_SEND_REQUEST  [dt=1]
t=1376676685223 [st= 3]        HTTP_TRANSACTION_SPDY_SEND_REQUEST_HEADERS
                               --> :host: www.wieistmeineip.de
                                   :method: GET
                                   :path: /
                                   :scheme: http
                                   :version: HTTP/1.1
                                   accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
                                   accept-encoding: gzip,deflate,sdch
                                   accept-language: de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4
                                   cache-control: max-age=0
                                   cookie: [3629 bytes were stripped]
                                   pragma: no-cache
                                   user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36
t=1376676685224 [st= 4]     -HTTP_TRANSACTION_SEND_REQUEST
t=1376676685224 [st= 4]     +HTTP_TRANSACTION_READ_HEADERS  [dt=37]
t=1376676685261 [st=41]        SPDY_STREAM_ERROR
                               --> description = "SPDY stream closed with status: 6"
                               --> status = -337
                               --> stream_id = 247
t=1376676685261 [st=41]     -HTTP_TRANSACTION_READ_HEADERS
                             --> net_error = -337 (ERR_SPDY_PROTOCOL_ERROR)
t=1376676685261 [st=41]   -URL_REQUEST_START_JOB
                           --> net_error = -337 (ERR_SPDY_PROTOCOL_ERROR)
t=1376676685261 [st=41] -REQUEST_ALIVE
                         --> net_error = -337 (ERR_SPDY_PROTOCOL_ERROR)

I've tried building a stresstesting tool using spdycat but wasn't able to clinically reproduce the error.

I'm using the master version from 11.08.2013 so commit 8f73303 unfortunately doesn't fix the issue.

According to the log file shrpx had about 5113 filedescriptors which is high but shouldn't hit any limits on the host machine.

Thanks!

tatsuhiro-t commented 10 years ago

From the logs, it looks like response was sent multiple times to the same stream, which causes "Failed to send control frame type=2, ...". The commit 60fbb30 fixes this possible behavior. Could you apply this change and see how the instance performs? More than 5000 file descriptors are bit high to me, but shrpx tends to use lots of fd between backend server for multiple HTTP/1.1 connections, so it could reach that number in a small burst.

n0wa11 commented 10 years ago

After applying 8f73303, the problem still occurred.

Now, I am observing 60fbb30.

@berstend, what is bothering me most is that the error is causing a positive feedback loop, which eventually consumes all CPU, RAM and has the log file filling up the disk.

tatsuhiro-t commented 10 years ago

Does the log message look like the same as before applying the patch? I have several tests about this, and the patch certainly fixes them, but there may be other execution path to trigger this issue.

tatsuhiro-t commented 10 years ago

I understand that CPU utilization goes up with this issue, I'm not sure why RAM is filed up.

The commit 096dce7 makes log level of message "dconn_ is NULL" to INFO so that log traffic is decreased for them if you log WARNING or higher level message only.

n0wa11 commented 10 years ago

Now that the debug level was ERROR, I couldn't seem to locate the initial errors, when the latest episode started, after 8f73303 and before 60fbb30.

tatsuhiro-t commented 10 years ago

And do you still see this issue happening?

n0wa11 commented 10 years ago

I applied the 60fbb30 commit today. Please let me observe for a few more days.

n0wa11 commented 10 years ago

I think the problem is still there. The CPU goes to 100% right away.

[^[[1;33mWARN^[[0m] [UPSTREAM:0x7f0c3007aea0] Failed to send control frame type=2, error_code=-512:The transmission is not allowed for this stream ^[[1;30m(shrpx_spdyupstream.cc:273)^[[0m [^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7f0c1c1f8d90] dconn is NULL ^[[1;30m(shrpxdownstream.cc:303)^[[0m [^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7f0c1c24c090] dconn is NULL ^[[1;30m(shrpxdownstream.cc:303)^[[0m [^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7f0c1811fab0] dconn is NULL ^[[1;30m(shrpxdownstream.cc:303)^[[0m [^[[1;33mWARN^[[0m] [DOWNSTREAM:0x7f0c1807ce70] dconn is NULL ^[[1;30m(shrpx_downstream.cc:303)^[[0m [^[[1;33mWARN^[[0m] [UPSTREAM:0x7f0c1806e9b0] Failed to send control frame type=2, error_code=-512:The transmission is not allowed

tatsuhiro-t commented 10 years ago

I reproduced similar situation, not the completely same because there was no "dconn_ is NULL" in my log. I made couple of commits for the fix. Please apply all changes.

n0wa11 commented 10 years ago

Thanks. Applied.

n0wa11 commented 10 years ago

Although the original problem hasn't come back, the 3 commits after 096dce7 introduced a new problem.

With Version 28 and 29, through Chrome's secure proxy, I cannot access websites like twitter or facebook. The error looks as if the http handshake failed.

If I use stunnel as the client, there are not such problem.

tatsuhiro-t commented 10 years ago

e7542ea should fix it

n0wa11 commented 10 years ago

I haven't had problems in the past few days. I guess the problem is fixed.

Thank you very much.

irrenhaus commented 10 years ago

@tatsuhiro-t I'm afraid this problem is not yet fixed for me (using the latest master commit 6efaa1c).

I'm using shrpx with -s and --no-via options. Additionally there are multiple threads via -n 8. System is a Ubuntu 12.04.02 LTS with latest security updates & kernel.

After some time one of the threads eventually will become unresponsive with a 100% CPU load. Some time after that the next one will follow and so one until all SHRPX threads are unresponsive with 100% CPU load.

This happens randomly and not predictable. Interestingly: If I restart the process when at least one thread is at 100% CPU most of the time one thread will again instantly use 100% CPU after the restart. If I let it go for some time and then do a restart SHRPX runs as normal. Until the next time :)

Any ideas? The logs don't say anything to me, I had multiple lines stating "dconn_ is NULL":

[INFO] [DOWNSTREAM:0x7f52ec35a1e0] HTTP response completed (shrpx_spdy_upstream.cc:891) [INFO] [UPSTREAM:0x7f52ec397310] Stream stream_id=2437 is being closed (shrpx_spdy_upstream.cc:99) [INFO] [DCONN:0x7f52ec0c53e0] Detaching from DOWNSTREAM:0x7f52ec35a1e0 (shrpx_http_downstream_connection.cc:309) [INFO] [CLIENT_HANDLER:0x7f52ec0ebff0] Pooling downstream connection DCONN:0x7f52ec0c53e0 (shrpx_client_handler.cc:265) [INFO] [DOWNSTREAM:0x7f52ec35a1e0] Deleting (shrpx_downstream.cc:67) [INFO] [DOWNSTREAM:0x7f52ec35a1e0] Deleted (shrpxdownstream.cc:77) [INFO] [DOWNSTREAM:0x7f52e445b190] dconn is NULL (shrpx_downstream.cc:303) [INFO] [UPSTREAM:0x7f52e451d340] RST_STREAM stream_id=1603 (shrpx_spdy_upstream.cc:654) [INFO] [DOWNSTREAM:0x7f52ec6c4c70] HTTP response header completed (shrpx_spdy_upstream.cc:800)

But these messages (dconn_ is null) only appear rarely and there are no suspicious (well, to me) messages around them.

Strangely enough it seemed this first happened after updating from revision e7542ea to revision 5eac2d9. Before that revision e7542ea was running quite well. But since then I have these errors regardless of the version (which, for me, means that this only is coincidentally linked to the update).

I just have to idea where to look next. Any tips?

Thank you very much

tatsuhiro-t commented 10 years ago

Do you mean you did not have the problem with e7542ea ?

tatsuhiro-t commented 10 years ago

Ah, maybe I was wrong, you said "regardless of the version".

irrenhaus commented 10 years ago

The problem occured the first time with revision 5eac2d9 but is still present after a downgrade to version e7542ea so I don't think it is related to the version, just wanted to give all information I have ;).

Just as a further notice: In the meantime the traffic of my system increased quite a bit (meaning over the last 2-3 weeks). Traffic increased from ~5 MB/s to ~10MB/s handled by shrpx. So maybe this is something related to bad formatted packets or something like that?

If I can do anything to provide you with more information I would be happy to help. Maybe it helps if I try to attach a gdb to the running shrpx instance and try to find out where the process hangs to eat up 100% CPU?

tatsuhiro-t commented 10 years ago

Thank you for the detailed information. I think increased traffic surely maxes out CPU usage, because shrpx does not rate limit on write (read side is rated limited). I committed the change 26f14a5 which rate limits write side. Please apply it and see what it can do.

If it does not work, yeah, gdb is quite helpful. Also checking log and too many requests from same client or too big downloads may help.

irrenhaus commented 10 years ago

I'm sorry but it didn't make any difference. If the traffic were so high that the CPU maxes out I would have thought that all CPU cores would be used equally?

Maybe I should have mention that one a thread goes up to 100% CPU usage it never goes back down - even if I let it run for a day in this state.

Edit: Also on another server I have a throughput equally to the problematic server and there the CPU is at ~8%-10% per core.

I will look further into debugging with gdb, sadly my skills for debugging threaded processes are very low and I have to do it in a live system so this could take some time.

Thanks for your help!

tatsuhiro-t commented 10 years ago

Has the another server not exhibited this issue? It may be libevent bug. Libevent in Ubuntu 12.04.02 LTS is 2.0.16. There are numerous bug fixes since then. Could you build shrpx with the latest libevent 2.0.21-stable? http://libevent.org/

irrenhaus commented 10 years ago

I'm now using a self-compile d libevent-2.0.21-stable. Looks good until now. I will let this run for a few days and see if the error occures again.

Thank you very much for your help!