Open keshonok opened 7 years ago
The background of the issue is pull request https://github.com/tempesta-tech/tempesta/pull/660 and comments https://github.com/tempesta-tech/tempesta/pull/660#discussion_r101899999 (tfw_http_req_fwd()
) and https://github.com/tempesta-tech/tempesta/pull/660#discussion_r101902213 (tfw_http_resp_fwd()
) in particular. Backoff and per-cpu lock data structures could be used just like in Linux MCS locking for the first case and maybe we can employ our our work queue to access TfwConn at per-cpu basis in the second case. Also see comments added in https://github.com/tempesta-tech/tempesta/pull/660/commits/6478c707cfb65916d53fada60ca17955f968c84c (contention on fwd_qlock
and seq_qlock
).
The other issue is multiple list operations, e.g. https://github.com/tempesta-tech/tempesta/pull/851#discussion_r158995682 .
Probably https://github.com/tempesta-tech/tempesta/pull/851#discussion_r158998002 is also good candidate to be done in context of this issue.
Linked with #940 and #941. The last describes couple of possible optimizations in https://github.com/tempesta-tech/tempesta/pull/884#discussion_r168962872 (points 2 and 3).
Also it seems that the only application of nip_queue
/nip_list
is just to unmark non-idempotent requests in a server send queue if a client sends more request(s) after a non-idempotent request, so probably the list can be eliminated.
Linked with #1065 (HTTP requests fair scheduling).
http.c
becomes very complex (it's the second largest source file after the parser), so it'd be good to split the logic to HTTP server, client, and common pieces.
Currently request processing routine looks like:
SS_Send
command;In this processing pipeline two queues are used: forward_queue and workqueue. First doesn't have distinct consumer and producer and protected by spinlocks, the last has distinct producer and consumer and implemented as lock-free. Probably we can revise the algorithm and get rid of locking;
SS_Send
command;Here producer to the forward_queue and consumer are split and no locks for forward_queue are required. Roughly the same algorithm for response processing should be implemented: push client socket into work queue after response is adjusted and ready to sent.
There is a difference how work queue performs now and how it should perform in my proposal. Now task in work queue - is a single message send operation. In my proposal - iterating over the list and send requests one-by-one. It's not wise to add a new task with the same connection: progress on previous task will make later tasks empty, and we still need to track budget.
Thus work queue should be reimplemented as round robin queue: Pushing into queue:
per_connection_budget
requests on the forward_queue, increment total_budget
by number of forwarded requests;total_budget
is not 0, proceed to the next task (step 1)PROs:
CONs:
May be such approach was discussed, when message pipelining was introduced and there are some caveats I don't know...
@ikoveshnikov good point! I appreciate further movement to the lock-free RB transport and do as much work as possible on local CPU. Also the proposal to move from single-message work for the RB to full connection processing looks quite good. We've never discussed such opportunity.
Some time ago we've discussed that http.c becomes too sophisticated and it makes sense to split it to client and server parts, or tx and rx parts in your proposal. So probably we can end up with simpler and faster HTTP processing.
Some time ago we've discussed that http.c becomes too sophisticated and it makes sense to split it to client and server parts
Totally agree here.
In context of last proposition, the problem with -EBUSY
error (which is one of the subjects of #1003 issue) may go away: if the connections themselves are in the work queue, we will be able to add requests there up to the limit of the queue itself (1000 by default) and the -EBUSY
problem (when we have not time to unload the work queue) will simply transform to exhaustion of the fwd_queue limit; so, we'll be able to avoid of requests rescheduling on -EBUSY
error (the current solution for #1003 issue).
Also, there are several moments which should be taken into account - regarding fwd_queue
itself. We still need to make the fwd_queue
lock-free, and in order to ease this task - proposed approach implies division push
, send
(read
) and pop
operations (which currently can be processed in one thread) into three threads, i.e. separate producer
, sender
(reader
) and consumer
contexts appear. However, some additional complexities need to be considered in this scheme:
sender
thread, apparently, it will also be necessary to evict (consume
) requests from the fwd_queue
(timeout eviction, dropped requests, and we can still run into SOCK_DEAD
as well); sender
thread, so msg_sent
resetting may occur in this context too;pop
and push
operations - probably should be redesigned in this approach.Probably an easer than https://github.com/tempesta-tech/tempesta/issues/687#issuecomment-449298649 solution is possible:
Each CPU can read TfwConn
object from the remote CPU, but need to 'send a message' to the CPU if it needs to send an HTTP message through it's queue, close it, or whatever else. Probably the same work queues, as we use for TCP sockets, can be reused - it seems we don't need them any more if we have per-CPU HTTP connections on higher layer.
Probably some CPU/connection scheduler is required for intentionally opened connection (listen-created connections are distributed by RSS/RPS). The scheduler algorithm is TBD.
One more optimization proposal from https://github.com/tempesta-tech/tempesta/issues/941#issuecomment-371494707 about more efficient handling of responses from dead client connections:
A client sends us several requests and the client connection immediately terminated. All of the requests are in fwd_queue's and are being sent to servers, reforwarded and so on until all of the receive responses. Then tfw_http_resp_fwd()
finds empty seq_queue of the client connection and calls ss_close_sync()
for each of the requests. All in all, we do a lot of unnecessary work.
It's OK to leave the dead requests in fwd_queue's to minimize lock contention, but it's easy and cheap to:
ss_close_sync()
in tfw_http_resp_fwd()
since all connection_error and connection_drop callbacks follow ss_do_close()
call in sock.c, so tfw_http_conn_cli_drop()
is always called on closed socket.tfw_http_conn_cli_drop()
should call tfw_connection_put()
just like tfw_sock_srv_connect_drop()
and tfw_sock_srv_connect_failover()
do. When tfw_http_resp_fwd()
calls tfw_http_conn_msg_free()
for the last request referencing the client connection, the connection is freed.
The current architecture extensively uses the locking of
seq_queue
in client connections andfwd_queue
in server connections to enforce the correct order of responses to HTTP requests. The locks are held for a relatively long time, and due to multi-CPU nature of Tempesta's operation there's significant lock contention in high load situations.As the current architecture and its behavioural patterns are understood better, it should be improved with the goal of decreasing the lock contention and improving the overall performance.