lpereira / lwan

Experimental, scalable, high performance HTTP server
https://lwan.ws
GNU General Public License v2.0
5.94k stars 548 forks source link

Unexpected socket closing #340

Closed pontscho closed 2 years ago

pontscho commented 2 years ago

Hey,

I wrote a service that open connections to an other service and serve data with Lwan. I experienced that Lwan - because of the internal socket cache - closes sockets wasn't opened by Lwan. For example:

  1. making a call on Lwan backend, it a open socket - for example - 22 and this socket was closed by end of HTTP call
  2. making a new call on Lwan backend starting a communication thread and it make a permanent connection, it open socket 22 and and leave the original HTTP connection
  3. after ~10seconds Lwan close this independent socket because of the timeout queue.

I made a small patch but I'm not sure this way is good enough for solving this issue but after a longer test it worked for me fine.

I'm really curious about your opinion, thanks Pontscho

--- a/src/lib/lwan-tq.c
+++ b/src/lib/lwan-tq.c
@@ -93,7 +93,10 @@ void timeout_queue_expire(struct timeout_queue *tq,
         conn->coro = NULL;
     }

-    close(lwan_connection_get_fd(tq->lwan, conn));
+    if ((conn->flags & CONN_CLOSED) == 0) {
+        close(lwan_connection_get_fd(tq->lwan, conn));
+    }
+    conn->flags |= CONN_CLOSED;
 }

--- a/src/lib/lwan.h
+++ b/src/lib/lwan.h
@@ -292,6 +292,8 @@ enum lwan_connection_flags {

     /* Is this a TLS connection? */
     CONN_TLS = 1 << 12,
+
+    CONN_CLOSED = 1 << 13,
 };
lpereira commented 2 years ago

Lwan doesn't have a cache for connections (it pre-allocates the connection structs but it's not really a cache per se).

What happens in this case is that the operating system always gives you back a file descriptor with the lowest available value first, and in the case that you make a request with Connection: close, and make another request, the OS is very likely to give you the same file descriptor from the connection that just closed to the new connection you just made.

But it's going to be a new connection from the OS perspective and from Lwan's perspective. You can look in the spawn_coro() function in lwan-thread.c to see this being initialized. You can follow what calls this function to understand the flow of how new connections are accepted and initialized.

Have you seen any problem related to this? Like, Lwan trying to close the same connection twice, or something? If you saw this then this is definitely a bug; otherwise, this is working as designed.

pontscho commented 2 years ago

Lwan doesn't have a cache for connections (it pre-allocates the connection structs but it's not really a cache per se).

Yep, I saw that when I "rev. eng'd" the lwan_connection_get_fd() function.

Have you seen any problem related to this? Like, Lwan trying to close the same connection twice, or something?

Definitely. Flow is that:

This last move has to effects:

That few lines above is just a band-aid, not a real solution but helps to understand the problem that cause a strange hazard. It was fun to find.

If you saw this then this is definitely a bug

I think it is a bug, but I wasn't sure about it isn't planned.

lpereira commented 2 years ago

I never saw this, and I run Lwan under strace frequently!

It's really not supposed to work this way: when connections are closed due to a timeout, the timeout is removed from the wheel, the connection is removed from timeout queue, the coroutine is freed, and the connection is finally closed. This all happens before epoll_wait() is called, so the I/O loop should never pick up events for a connection that has been closed (and thus, the move_to_last() call, or even resuming a coroutine associated with a closed connection happening doesn't make any sense to me)

I'm really puzzled

Are you seeing this on which system? Linux, macOS or a BSD system?

On Thu, Apr 7, 2022, at 11:54 PM, pontscho wrote:

Lwan doesn't have a cache for connections (it pre-allocates the connection structs but it's not really a cache per se).

Yep, I saw that when I "rev. eng'd" the lwan_connection_get_fd() function.

Have you seen any problem related to this? Like, Lwan trying to close the same connection twice, or something?

Definitely. Flow is that:

  • when HTTP call happened, spawn_coro() calls timeout_queue_insert()

  • after that call or a while (I don't remember my debug logs) timeout_queue_expire() was called from somewhere

  • after a while Lwan reinsert this entry into the tq with timeout_queue_move_to_last() (in thread_io_loop()) after this socket was expire in previous call (resume_coro()) This last move has to effects:

  • Lwan closes foreign sockets unexpectedly,

  • cause some performance degradation, because timeout list is getting longer and longer (I saw this in my debug logs, Lwan try to close those invalid sockets in tq forever). That few lines above is just a band-aid, not a real solution but helps to understand the problem that cause a strange hazard. It was fun to find.

If you saw this then this is definitely a bug

I think it is a bug, but I wasn't sure about it isn't planned.

— Reply to this email directly, view it on GitHub https://github.com/lpereira/lwan/issues/340#issuecomment-1092503083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADVGMYJY4YSRQO22VJ4UDVD7JZLANCNFSM5SWPRWFQ. You are receiving this because you commented.Message ID: @.***>

pontscho commented 2 years ago

Sorry for the late answer, I had some time just now for working on this issue.

I know how sockets are working on tcp, that's why I was surprised those unexpected socket closing. Let me explain steps for reproducing this issue with current master (939dbf6356174af94cc8dff096d809480ad7c219):

1) first of all, apply this patch please:

diff --git a/src/lib/lwan-tq.c b/src/lib/lwan-tq.c
index 096e56b1..76d19fa9 100644
--- a/src/lib/lwan-tq.c
+++ b/src/lib/lwan-tq.c
@@ -66,6 +66,7 @@ void timeout_queue_move_to_last(struct timeout_queue *tq,
      * served.  In practice, if this is called, it's a keep-alive connection. */
     conn->time_to_expire = tq->current_time + tq->move_to_last_bump;

+    lwan_status_info(" ! LWAN MOVE TO LAST: %d", lwan_connection_get_fd(tq->lwan, conn));
     timeout_queue_remove(tq, conn);
     timeout_queue_insert(tq, conn);
 }
@@ -93,6 +94,7 @@ void timeout_queue_expire(struct timeout_queue *tq,
         conn->coro = NULL;
     }

+    lwan_status_info(" ! LWAN CLOSE: %d", lwan_connection_get_fd(tq->lwan, conn));
     close(lwan_connection_get_fd(tq->lwan, conn));
 }

2) start Lwan, for example that hello http server from samples directory

3) make a GET call first: curl -s -X GET http://127.0.0.1:8080 > /dev/null, you have to see something like this:

1606947 lwan-request.c:1488 log_request() 127.0.0.1 [Tue, 19 Apr 2022 13:51:09 GMT] ae3e96ebc6245d8b "GET / HTTP/1.1" 200 text/plain (r:0.003ms p:0.528ms)
1606947 lwan-tq.c:69 timeout_queue_move_to_last()  ! LWAN MOVE TO LAST: 40
1606947 lwan-tq.c:97 timeout_queue_expire()  ! LWAN CLOSE: 40

I think this is normal, but...

4) make a PUT or POST call: curl -s -X PUT http://127.0.0.1:8080 > /dev/null and you have to see something like this:

1606947 lwan-request.c:1488 log_request() 127.0.0.1 [Tue, 19 Apr 2022 13:51:39 GMT] 1be8c15addce20a4 "PUT / HTTP/1.1" 400 text/html (r:0.007ms p:1.203ms)
1606947 lwan-tq.c:97 timeout_queue_expire()  ! LWAN CLOSE: 40
1606947 lwan-tq.c:69 timeout_queue_move_to_last()  ! LWAN MOVE TO LAST: 40
 ... after ~20 seconds...
1606947 lwan-tq.c:97 timeout_queue_expire()  ! LWAN CLOSE: 40

Lwan close socket 40 twice, and if I open a socket and it get a socket was used by Lwan previously, Lwan will close that socket whatever who open it.

lpereira commented 2 years ago

Thanks for the steps to reproduce this. I'll take a look over the weekend and see if I can figure something out.

On Tue, Apr 19, 2022, at 6:57 AM, pontscho wrote:

Sorry for the late answer, I had some time just now for working on this issue.

I know how sockets are working on tcp, that's why I was surprised those unexpected socket closing. Let me explain steps for reproducing this issue with current master (939dbf6):

  1. first of all, apply this patch please: `diff --git a/src/lib/lwan-tq.c b/src/lib/lwan-tq.c index 096e56b1..76d19fa9 100644 --- a/src/lib/lwan-tq.c +++ b/src/lib/lwan-tq.c @@ -66,6 +66,7 @@ void timeout_queue_move_to_last(struct timeout_queue *tq,
    • served. In practice, if this is called, it's a keep-alive connection. */ conn->time_to_expire = tq->current_time + tq->move_to_last_bump;
  • lwan_status_info(" ! LWAN MOVE TO LAST: %d", lwan_connection_get_fd(tq->lwan, conn)); timeout_queue_remove(tq, conn); timeout_queue_insert(tq, conn); } @@ -93,6 +94,7 @@ void timeout_queue_expire(struct timeout_queue *tq, conn->coro = NULL; }

  • lwan_status_info(" ! LWAN CLOSE: %d", lwan_connection_get_fd(tq->lwan, conn)); close(lwan_connection_get_fd(tq->lwan, conn)); }

`

  1. start your hello sample http server
  2. make a GET call first: curl -s -X GET http://127.0.0.1:8080 > /dev/null, you have to see something like this: 1606947 lwan-request.c:1488 log_request() 127.0.0.1 [Tue, 19 Apr 2022 13:51:09 GMT] ae3e96ebc6245d8b "GET / HTTP/1.1" 200 text/plain (r:0.003ms p:0.528ms) 1606947 lwan-tq.c:69 timeout_queue_move_to_last() ! LWAN MOVE TO LAST: 40 1606947 lwan-tq.c:97 timeout_queue_expire() ! LWAN CLOSE: 40

This steps are okay. 4) make a PUT call like this: curl -s -X PUT http://127.0.0.1:8080 > /dev/null and you have to see something like this:

1606947 lwan-request.c:1488 log_request() 127.0.0.1 [Tue, 19 Apr 2022 13:51:39 GMT] 1be8c15addce20a4 "PUT / HTTP/1.1" 400 text/html (r:0.007ms p:1.203ms) 1606947 lwan-tq.c:97 timeout_queue_expire() ! LWAN CLOSE: 40 1606947 lwan-tq.c:69 timeout_queue_move_to_last() ! LWAN MOVE TO LAST: 40 ... after ~20 seconds... 1606947 lwan-tq.c:97 timeout_queue_expire() ! LWAN CLOSE: 40 Lwan close socket 40 twice, and if I open a socket and it get a socket was used by Lwan previously, Lwan will close that socket.

— Reply to this email directly, view it on GitHub https://github.com/lpereira/lwan/issues/340#issuecomment-1102689598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADVGPCLSSFURIQYRSGGTTVF23UDANCNFSM5SWPRWFQ. You are receiving this because you commented.Message ID: @.***>

lpereira commented 2 years ago

I had some time tonight and was able to reproduce and fix the issue. This was actually a long-standing problem that should be fixed now. Thanks for reporting and the steps to reproduce! Please let me know if it works properly now.

On Tue, Apr 19, 2022, at 7:33 PM, L. Pereira wrote:

Thanks for the steps to reproduce this. I'll take a look over the weekend and see if I can figure something out.

On Tue, Apr 19, 2022, at 6:57 AM, pontscho wrote:

Sorry for the late answer, I had some time just now for working on this issue.

I know how sockets are working on tcp, that's why I was surprised those unexpected socket closing. Let me explain steps for reproducing this issue with current master (939dbf6):

  1. first of all, apply this patch please: `diff --git a/src/lib/lwan-tq.c b/src/lib/lwan-tq.c index 096e56b1..76d19fa9 100644 --- a/src/lib/lwan-tq.c +++ b/src/lib/lwan-tq.c @@ -66,6 +66,7 @@ void timeout_queue_move_to_last(struct timeout_queue *tq,
    • served. In practice, if this is called, it's a keep-alive connection. */ conn->time_to_expire = tq->current_time + tq->move_to_last_bump;
  • lwan_status_info(" ! LWAN MOVE TO LAST: %d", lwan_connection_get_fd(tq->lwan, conn)); timeout_queue_remove(tq, conn); timeout_queue_insert(tq, conn); } @@ -93,6 +94,7 @@ void timeout_queue_expire(struct timeout_queue *tq, conn->coro = NULL; }

  • lwan_status_info(" ! LWAN CLOSE: %d", lwan_connection_get_fd(tq->lwan, conn)); close(lwan_connection_get_fd(tq->lwan, conn)); }

`

  1. start your hello sample http server
  2. make a GET call first: curl -s -X GET http://127.0.0.1:8080 > /dev/null, you have to see something like this: 1606947 lwan-request.c:1488 log_request() 127.0.0.1 [Tue, 19 Apr 2022 13:51:09 GMT] ae3e96ebc6245d8b "GET / HTTP/1.1" 200 text/plain (r:0.003ms p:0.528ms) 1606947 lwan-tq.c:69 timeout_queue_move_to_last() ! LWAN MOVE TO LAST: 40 1606947 lwan-tq.c:97 timeout_queue_expire() ! LWAN CLOSE: 40

This steps are okay. 4) make a PUT call like this: curl -s -X PUT http://127.0.0.1:8080 > /dev/null and you have to see something like this:

1606947 lwan-request.c:1488 log_request() 127.0.0.1 [Tue, 19 Apr 2022 13:51:39 GMT] 1be8c15addce20a4 "PUT / HTTP/1.1" 400 text/html (r:0.007ms p:1.203ms) 1606947 lwan-tq.c:97 timeout_queue_expire() ! LWAN CLOSE: 40 1606947 lwan-tq.c:69 timeout_queue_move_to_last() ! LWAN MOVE TO LAST: 40 ... after ~20 seconds... 1606947 lwan-tq.c:97 timeout_queue_expire() ! LWAN CLOSE: 40 Lwan close socket 40 twice, and if I open a socket and it get a socket was used by Lwan previously, Lwan will close that socket.

— Reply to this email directly, view it on GitHub https://github.com/lpereira/lwan/issues/340#issuecomment-1102689598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADVGPCLSSFURIQYRSGGTTVF23UDANCNFSM5SWPRWFQ. You are receiving this because you commented.Message ID: @.***>

pontscho commented 2 years ago

Hey, I think this bug is fixed, thank you very much. And other hand, after this patch the CPU usage of Lwan decreased on macos.

lpereira commented 2 years ago

That's a nice surprise! I rarely test Lwan on macOS these days as my Mac gave up the ghost, but it's good to know.