tempesta-tech / tempesta

All-in-one solution for high performance web content delivery and advanced protection against DDoS and web attacks
https://tempesta-tech.com/
GNU General Public License v2.0
614 stars 103 forks source link

Improve the architecture that supports the correct order of HTTP responses #687

Open keshonok opened 7 years ago

keshonok commented 7 years ago

The current architecture extensively uses the locking of seq_queue in client connections and fwd_queue in server connections to enforce the correct order of responses to HTTP requests. The locks are held for a relatively long time, and due to multi-CPU nature of Tempesta's operation there's significant lock contention in high load situations.

As the current architecture and its behavioural patterns are understood better, it should be improved with the goal of decreasing the lock contention and improving the overall performance.

krizhanovsky commented 7 years ago

The background of the issue is pull request https://github.com/tempesta-tech/tempesta/pull/660 and comments https://github.com/tempesta-tech/tempesta/pull/660#discussion_r101899999 (tfw_http_req_fwd()) and https://github.com/tempesta-tech/tempesta/pull/660#discussion_r101902213 (tfw_http_resp_fwd()) in particular. Backoff and per-cpu lock data structures could be used just like in Linux MCS locking for the first case and maybe we can employ our our work queue to access TfwConn at per-cpu basis in the second case. Also see comments added in https://github.com/tempesta-tech/tempesta/pull/660/commits/6478c707cfb65916d53fada60ca17955f968c84c (contention on fwd_qlock and seq_qlock).

The other issue is multiple list operations, e.g. https://github.com/tempesta-tech/tempesta/pull/851#discussion_r158995682 .

Probably https://github.com/tempesta-tech/tempesta/pull/851#discussion_r158998002 is also good candidate to be done in context of this issue.

Linked with #940 and #941. The last describes couple of possible optimizations in https://github.com/tempesta-tech/tempesta/pull/884#discussion_r168962872 (points 2 and 3).

Also it seems that the only application of nip_queue/nip_list is just to unmark non-idempotent requests in a server send queue if a client sends more request(s) after a non-idempotent request, so probably the list can be eliminated.

Linked with #1065 (HTTP requests fair scheduling).

http.c becomes very complex (it's the second largest source file after the parser), so it'd be good to split the logic to HTTP server, client, and common pieces.

vankoven commented 5 years ago

Currently request processing routine looks like:

  1. NET_RX_SOFTIRQ
  2. parse request and detach from client connection;
  3. make all required modifications to request before forwarding;
  4. assign request into forward_queue on some server connection;
  5. Forward queue lock is acquired;
  6. forward all unsent requests on that server connection;
  7. for each request to be forwarded: push server connection socket to workqueue with SS_Send command;
  8. Forward queue lock is released;
  9. NET_TX_SOFTIRQ
  10. pop tasks from work queue and complete them: send already prepared buffers to sockets.

In this processing pipeline two queues are used: forward_queue and workqueue. First doesn't have distinct consumer and producer and protected by spinlocks, the last has distinct producer and consumer and implemented as lock-free. Probably we can revise the algorithm and get rid of locking;

  1. NET_RX_SOFTIRQ
  2. parse request and detach from client connection;
  3. make all required modifications to request before forwarding;
  4. assign request into forward_queue on some server connection. lock-free operation;
  5. push server connection socket to workqueue with SS_Send command;
  6. NET_TX_SOFTIRQ
  7. pop tasks from work queue and complete them: get server connection from socket;
  8. forward all unsent requests on that server connection;
  9. for each request to be forwarded: send already prepared buffers to socket, don't pop requests from the forwarding_queue

Here producer to the forward_queue and consumer are split and no locks for forward_queue are required. Roughly the same algorithm for response processing should be implemented: push client socket into work queue after response is adjusted and ready to sent.

There is a difference how work queue performs now and how it should perform in my proposal. Now task in work queue - is a single message send operation. In my proposal - iterating over the list and send requests one-by-one. It's not wise to add a new task with the same connection: progress on previous task will make later tasks empty, and we still need to track budget.

Thus work queue should be reimplemented as round robin queue: Pushing into queue:

PROs:

CONs:

May be such approach was discussed, when message pipelining was introduced and there are some caveats I don't know...

krizhanovsky commented 5 years ago

@ikoveshnikov good point! I appreciate further movement to the lock-free RB transport and do as much work as possible on local CPU. Also the proposal to move from single-message work for the RB to full connection processing looks quite good. We've never discussed such opportunity.

Some time ago we've discussed that http.c becomes too sophisticated and it makes sense to split it to client and server parts, or tx and rx parts in your proposal. So probably we can end up with simpler and faster HTTP processing.

vankoven commented 5 years ago

Some time ago we've discussed that http.c becomes too sophisticated and it makes sense to split it to client and server parts

Totally agree here.

aleksostapenko commented 5 years ago

In context of last proposition, the problem with -EBUSY error (which is one of the subjects of #1003 issue) may go away: if the connections themselves are in the work queue, we will be able to add requests there up to the limit of the queue itself (1000 by default) and the -EBUSY problem (when we have not time to unload the work queue) will simply transform to exhaustion of the fwd_queue limit; so, we'll be able to avoid of requests rescheduling on -EBUSY error (the current solution for #1003 issue).

Also, there are several moments which should be taken into account - regarding fwd_queue itself. We still need to make the fwd_queue lock-free, and in order to ease this task - proposed approach implies division push, send (read) and pop operations (which currently can be processed in one thread) into three threads, i.e. separate producer, sender (reader) and consumer contexts appear. However, some additional complexities need to be considered in this scheme:

  1. In sender thread, apparently, it will also be necessary to evict (consume) requests from the fwd_queue (timeout eviction, dropped requests, and we can still run into SOCK_DEAD as well);
  2. Re-sending procedure during connection repairing - also seems to be in the context of sender thread, so msg_sent resetting may occur in this context too;
  3. The base re-scheduling procedure (during connection closing) - which is essentially pop and push operations - probably should be redesigned in this approach.
krizhanovsky commented 5 years ago

1175 and #1180 are perfect examples of how complicated our HTTP connections processing is. The bugs have very complex scenarios and required a lot of time to find and fix the problem. Moreover, we had a lot of other synchronization problems on HTTP connections and queues. All in all, current architecture is too complex to develop and support and have clear performance issues. A better solution is required. I make the issue crucial because of constant queue of the bugs.

Probably an easer than https://github.com/tempesta-tech/tempesta/issues/687#issuecomment-449298649 solution is possible:

  1. assign each HTTP connection to a CPU, just like we do this for TCP socket.
  2. distribute works among the connections through the lock-free RB queue transport.

Each CPU can read TfwConn object from the remote CPU, but need to 'send a message' to the CPU if it needs to send an HTTP message through it's queue, close it, or whatever else. Probably the same work queues, as we use for TCP sockets, can be reused - it seems we don't need them any more if we have per-CPU HTTP connections on higher layer.

Probably some CPU/connection scheduler is required for intentionally opened connection (listen-created connections are distributed by RSS/RPS). The scheduler algorithm is TBD.

krizhanovsky commented 5 years ago

One more optimization proposal from https://github.com/tempesta-tech/tempesta/issues/941#issuecomment-371494707 about more efficient handling of responses from dead client connections:

A client sends us several requests and the client connection immediately terminated. All of the requests are in fwd_queue's and are being sent to servers, reforwarded and so on until all of the receive responses. Then tfw_http_resp_fwd() finds empty seq_queue of the client connection and calls ss_close_sync() for each of the requests. All in all, we do a lot of unnecessary work.

It's OK to leave the dead requests in fwd_queue's to minimize lock contention, but it's easy and cheap to:

  1. set some flag for the reuests to make them evicted from fwd_queue right they're going to be send to a server or rescheduled;
  2. we don't need to call ss_close_sync() in tfw_http_resp_fwd() since all connection_error and connection_drop callbacks follow ss_do_close() call in sock.c, so tfw_http_conn_cli_drop() is always called on closed socket.
  3. tfw_http_conn_cli_drop() should call tfw_connection_put() just like tfw_sock_srv_connect_drop() and tfw_sock_srv_connect_failover() do. When tfw_http_resp_fwd() calls tfw_http_conn_msg_free() for the last request referencing the client connection, the connection is freed.