krizhanovsky commented 7 years ago

Currently, we do some calculations in timer basis and the calculations can be relatively expensive. Meantime, timer is serviced in the same softirq processing network traffic, so it can lead to packet loss. A separate kernel thread should be introduced to manage such non crucial logic (e.g. connection eviction and statistics). Another option and/or feature of the timer implementation is described in https://github.com/tempesta-tech/tempesta/issues/736#issuecomment-369945612 as a two-mode timer implementation balancing between softirq shots and timer interrupts.

It's worth mentioning that Linux TCP code also uses timers, so an additional research of behavior of concurrently armed timers on thousands of connections (e.g. Slow DDoS) is required.

krizhanovsky commented 6 years ago

tfw_apm_add_srv() creates a timer for each server, so for scenario of #76 with thousands of servers it creates thousands of timers, so Tempesta FW being w/o any load eats CPU for 10K servers:

    26.64%  [tempesta_fw]  [k] tfw_apm_prcntl_tmfn
    16.04%  [kernel]       [k] __lock_acquire
     6.30%  [kernel]       [k] lock_release
     5.67%  [kernel]       [k] run_timer_softirq
     5.48%  [kernel]       [k] check_chain_key

A kernel thread should traverse the counters and sleep until timeout for the first updated APM elapses.

krizhanovsky commented 6 years ago

The issue simply hangs a server on 30K server groups.

The issue seems relatively complex. There are several thoughts about requirements to the timer replacing threads:

there should be per-cpu threads, just like ksoftirqd (see add_timer_on()).
the threads must work with timer-like arrays, so a producer can schedule a work to be done nearly in 1ms after scheduling a work to be done nearly in 1 s.
it seems the threads can do the jobs in near time, mroe relaxed than Linux kernel timer does it;
the jobs must be aggresively batched, e.g. it has sense to reestablish many connections at once (probably wich have slightly different retry timeouts) than execute a reestablishing function for each connection;
the threads must actively voluntary yield CPU for softirqs

It's worth mentioning that comment for inet_csk_init_xmit_timers() suggests to replace the 3 timers by only one with varying delays depending on the expected event.

Client keep-alive timer

Currently tfw_cli_conn_send() can be called from a softirq serving server response on a CPU X and modifies the timer on the CPU. Meantime the timer was set up by tfw_cli_conn_alloc()on establishing a new client connection on CPU Y (the CPU native the client connection). So it has sense to replace the timer by a per-CPU lists of client connections and make a CPU gate, like ss_send()/ss_do_send(), which reinserts a connection on transmission. The old connections can be evicted (closed) at the same time with the reinsertion and by a timer or a new eviction thread. The eviction thread should be lighter than a timer because it doesn't require mod_timer() and can exit quickly on just checking a timestamp of a client connection at the head of the queue.

Another opportunity is to modify TCP keepalive timer to drop inactive connection instead of sending TCP probe. In this case the standard behaviour must be saved and the new one is introduced for client connections only. The pros of the approach is only one timer in the system instead of two. The cons is larger kernel patch.

APM percentiles update

The reason for tfw_apm_prcntl_tmfn() to be in the top is heavy calculations with many loops, not the timers itself. If possible, tfw_apm_prcntl_tmfn() should be reworked to:

perform necessary computation only on the local CPU;
use less number of loops;
make computations on response time updates rather than by timeout.

This particular point is linked with issue #712 (Review APM & Ratio scheduler). If a "good" APM should use more complex algorithms, then with #712 in mind the whole analytics must be moved to user space and probably get cleanded (e.g. sampled) data from the kernel.

References

See the timer wheel design along with the appropriate code comment - there are already timer arrays with declining accuracy to batch the timer events along with timer slack.

krizhanovsky commented 6 years ago

More timer issues

BH synchronizations

Note that #916 introduces multiple _bh() synchronization calls, e.g. https://github.com/tempesta-tech/tempesta/pull/916#discussion_r170275718 . With moving some work from softirq (timer) context to kernel threads, the synchronization can be relaxed to remove _bh()'s.

Also keep this comment https://github.com/tempesta-tech/tempesta/pull/916#discussion_r170582808 in mind.

Server connections

Currently server connections, even for the initial attempt in tfw_sock_srv_connect_srv(), are established through a timer. It seems it was done because of simpler implementation, but technically this looks dirty.

Keep alive timer on client connections

1428 happened due to timer mismanagement: there are 4 possible concurrent events on a client socket:

keep alive timer, probably closing a corresponding client connection (the case for #1428)
transmission on the socket (e.g. response forwarding)
connection_drop hook call, e.g. on connection close from the peer
connection closing from Tempesta side with appropriate TLS close notify alert transmission (the case for #1428)

So we had to introduce a timer_lock in #1429 to prevent the races.

We can introduce a socket state (closed/active) and timestamp for the KA timer. All the socket connections residing on a local CPU can be arranged in a queue by the KA timestamp. Each softirq shot just needs to check the queue head for elaspsed timestamps in a busy looping mode. If there is not enough traffic, then we should fallback to the timer interrupts. I.e. it makes sense to introduce a timer implementation for Tempesta code, which work in two modes: softirq busy looping and timer interrupts.

Too many timer modifications

At the moment at least clnt_sock.c and websocket.c call mod_timer() on each ingress packet. Having that we normally may have 1 hour keep-alive timer, we do way too many timer updates. It could make sense just to let timer expire and check a connection timestamps.

tempesta-tech / tempesta

Move timers to kernel thread #736

Client keep-alive timer

APM percentiles update

References

More timer issues

BH synchronizations

Server connections

Keep alive timer on client connections

1428 happened due to timer mismanagement: there are 4 possible concurrent events on a client socket:

Too many timer modifications