Open krizhanovsky opened 8 years ago
From #100: the most urgent thing is to keep security accounting data for a client for some time after the last client connection is closed. This is very important to track client security limits properly for Connection: closed
connections. See https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/client.c#L89
Since we have to evict client accounting data after 'some time', it has sense to store them in a TDB table.
The comment is moved to separate issue #1115.
The consequence of the issue also appears on simple test with configuration as
listen 192.168.100.4:80;
server 127.0.0.1:9090 conns_n=1;
cache 0;
server_queue_size 1;
In this case we get a lot of error responses:
# ./wrk -c 4096 -t 8 -d 30 http://192.168.100.4:80/
Running 30s test @ http://192.168.100.4:80/
8 threads and 4096 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 135.34ms 260.83ms 1.98s 89.95%
Req/Sec 8.12k 2.74k 29.53k 72.39%
1934441 requests in 30.09s, 207.60MB read
Socket errors: connect 0, read 0, write 0, timeout 882
Non-2xx or 3xx responses: 1916248
Requests/sec: 64296.04
Transfer/sec: 6.90MB
The only one server queue is busy, but we continue to read new requests and just send error responses for them. The activity wastes resources and decreases user experience. Instead we must politely slow down the clients if we're unable to process their requests and not to read their requests.
The problem is the subject for #940 (Requests queueing if there is no backend connection).
Comment for tfw_client_obtain()
: actually clients can be relatively reliably identified with TLS sessions (#1054) or HTTP sticky cookies, so probably current HTTP sessions and TLS sessions should be aggregated in some way and provide robust client identification for machine learning classification. See countermeasures against session fixation attack.
UPD The comment is wrong. Malicious uses can ignore TLS session resumption, so each session will have different ID and tracked as different client. Probably we can use TLS session identifiers as replacement or together with sticky cookies, but client session and client are different objects.
A several important notes from the last conversations:
there should be /proc/tempesta
statistic for clients which is used for the QoS. However, since there a many clients, the statistic must be aggregated somehow. The exact method for aggregation and representation is TBD.
the QoS must not affect at all whitelisted clients and affect clients passing JS/Cookie challenge (trusted clients) which much lower weights.
a client receive TCP buffer can be reduced to 0. In this case the QoS just implements TARPIT blocking technique.
Some of the ideas for the issue are inspired by Web2K: Bringing QoS to Web Servers and the AIS Danger Theory. Basic clients clusterization and classification is required.
Stress calculation
HTTP QoS works with 3 input sources:
There should be 3 stress modules: for local system (running Tempesta) stress, upstream servers stress, and dangerous clients. Following local stress parameters must be monitored and their triggering values must be configurable:
tcp_enter_cwr()
,tcp_enter_loss()
or some other TCP CWND modification function. Check for appropriateNET_INC_STATS()
calls for the stress accounting.tcp_under_memory_pressure()
& Ko) is an obvious solution here.Upstream servers stress basically should be based on APM results. Plus to DDoS mitigation, the technique mitigates application upstream server livelock caused by Tempesta FW processing: if an upstream server works on the same host as Tempesta FW, then softirq monopolizes CPU, so all ingress traffic is processed by Tempesta FW only leaving no CPU resources for user space activities. Mogul addressed the issue. Following arguments must be measured:
ReqNum
) by a client to number of forwarded responses to it (RespNum
). This is per client measurement (this is Mēris DDoS case).A client is obviously danger if it ignores our cookies or JS challenge, i.e. it just sends us many requests without set cookie and ignoring JS timers. It could be just dummy web client which is probably fine, so we should analyze other performance measurements for the client (how many requests does it send, requests/response rates etc). See static limit implementation for #535.
If any of the stress modules is triggered by a exceeding a system or upstream limit, e.g. packet drops or a upstream response time, then the most greedy clients (in sending the largest number of packets or highest
ReqNum/RespNum
value correspondingly) must get reduced TCP windows for all their connections or all the connection must be closed.TBD Different locations and request methods can load server differently and we should not rely on the "average load".
QoS
Different resources of the same vhost or different vhosts as well as different server groups can require different QoS. Local stress module configuration is global, while limits for upstream servers response time must be configured with the granularity (e.g. server group servicing static content responses faster than dynamic content servers running database queries). QoS of a particular resource/vhost/server group must be specified by an integer with values from 0 to 100: the higher value means higher QoS. The default value is 50. If we cannot process request for high QoS client or resource value, then stress event is triggered and connections eviction or TCP window sizes reduction must be initiated.
Typically arrays of lists indexed by client QoS weights should be used to avoid big locks on reordering. If a client changes it's QoS weight, then it's moved to the list of appropriate QoS value.
We need to provide traffic shaping for vhosts. Basic L3 traffic shaping can be done by
tc
: lower Qdisc bandwidth will rise upstream stress event, so QoS for all the vhost clients must be reduced. However, configuration option for HTTP request per second must be introduced for vhosts.QoS API must be generic enough for further ML classifiers working in user-space, so complex clustering algorithms can be used to set clients QoS more accurately. While HTTP messages are offloaded to user space by #77, we also need to export client statistics to user space as well.
Since QoS rules can be used for DDoS mitigation, then it's expected to have plenty of the rules and most of them can be changed dynamically. So the rules should be stored in TDB (probably with some eviction strategy) and be analyzed with
tdbq
. In this sense the issue relates #731.TCP-BPF addresses per-connection TCP parameters (e.g. buffers sizes, cwnd etc.). Also BBRx and PCC-Vivace employs ML for congestion control algorithms. So it has sense to change not only receive window as a dynamic QoS parameters: plus to dynamic performance optimization we must solve probable malicious resource exhausting problem.
BTW in some cases client and backend connections can work in very different network environments, e.g. poor far Internet connections for clients and fast LAN connections with backends. So consider (maybe move to a new task) setting different congestion control algorithms for client and server connections as well as using different and dynamically calculated parameters as in TCP-BPF.
Server/vhost QoS
There are 2 types of QoSes: client and server QoS. Server/vhost QoS is static, defined by a system administrator and defines how important a resource is, client QoS is dynamic and calculated depending on current system and backend stress caused by the client. Client QoS is expressed by TCP receive buffer and receive window correspondingly, i.e. it essentially manipulates socket throughput in bytes. Meantime, it has sense to manage server/vhost QoS in terms of RPS.
Currently we have ratio loadbalancing working in terms of forwarded requests. QoS, just like
tc
, should care about minimum provided RPS to a particular vhost/sever (if one wants throughput QoS, they can usetc
for backend IP addresses), in particular 2 configuration options must be introduced:qos_rps <value in rps>
- minimum provided RPS to a vhostqos_delay <percentile> <value in ms>
- maximum request delayAPM provides perecentile delays and shall provide RPS statistics. If APM faces lower values for the QoS statistics for a particular vhost/server, it should trigger upstream server stress event, so some client TCP receive buffers must be reduced, to get more resources. The question is how to defile the some clients for suppressing: we should not(?) suppress clients requesting the crucial vhosts (we have minimum RPS, clients want more RPS and that's wrong to shape them out). Probably we should leave the question for more advanced ML with clients clusterization and just limit the most active clients for this task. A better solution is TBD.
There could be a wrong configuration, e.g.
qos_delay
less than the network delay. This is completely wrong to try to shape clients in this case, so we should stop in some attempts and print a warning that we can not achieve the specified QoS.Need to carefully analyze how our server QoS (also as a stress trigger) working in terms of RPS can cooperate with native Linux QoS working with PPS and throughput. In particuar, can we handle a QoS stress trigger by TCP CA function calls? How can we configure RPS QoS and integrate it with Linux QoS?
TCP flow control
Stress configurations must have soft and hard limits. Reaching soft limit triggers TCP window sizes reduction AND stopping accepting new connections for system stress, while hard limits requires immediate resource freeing, so connections must be evicted (closed) immediately. Connections closing can be very harmful for session oriented Web services, so this is the last thing which we should do. The connections closing can use sending HTTP error message, normal TCP closing (sending FIN), resetting (sending RST) or silent dropping (just silently free the connection data structures - the roughest method, but the most efficient under DDoS). The behavior must be configurable. Meantime QoS for resources should be guaranteed by TCP windows reduction only.
Suppose we have 2 sockets:
sock_rcv
- a socket on which we currently read data andsock_snd
- a socket which we forward the received data through.sock_snd
may not be able to send all the received data due to TCP congestion and receive windows as well as other competing receive sockets (e.g. if we have several client sockets sending requests through the same server socket). Thus our announced receive window forsock_rcv
must be influenced by the congestion and receive windows onsock_snd
. Moreover, since TCP windows are dynamic, we have to keep some more data in TCP send queue plus to the data on-the-wire to be able to immediately send more data. However, there is no sense to obeytcp_wmem
limitation:server_queue_size
,server_forward_timeout
, andproxy_buffering
from #498 limits.Besides limiting the client TCP window size, we might need to limit the window size on the upstream TCP with HTTP/1 connection to block the HTTP/2 amplification attacks.
TODO. Currently we don't bother with Tempesta socket memory limitations since in proxy mode we just forward packets instead of real allocations. Probably this is an issue. Probably sockets can be freed from under us. See
__sk_free(sk)
call insock_wfree()
.HTTP flow control
HTTP/2 (#309) and HTTP/3 (#724) provide close to TCP flow control, so full HTTP proxying #1125 make the same TCP flow control concepts described above applicable to HTTP window.
HTTP/2 HPACK, and correspondingly HTTP/3 QPACK, introduces HTTP/2 amplification threat which must be handled with #498 flow control and QoS of the issue. Basically, we need to compare and limit ratio between ingress HTTP/2 and egress HTTP/1.1.
This task also relates to equal QoS (root stream prioritization) for different clients. See The ultimate guide to HTTP resource prioritization (task #1196).
In case of HTTP/2 or QUIC <-> Tempesta <-> HTTP/2 or QUIC (see #1125) we might need propagatee the client flow control settings to the upstream connections to block the HTTP/2 amplification attacks.
Clients handling
Currently we identify client by their IP only. A new configuration option must be introduced to specify which data should be used for client identification (Sticky cookie, IP, User-Agent, Referer etc). Early client operation must still be done by IP address, for parent client, for Frang low level limiting. However, as HTTP requests is read a new child client must be "forked" and used hereafter for accounting.
Filter module must call a client connections closing (in configured fashion) when it adds its address to blocking table. That will faster free unnecessary resources.
We also must implement default and Keep-Alive header defined timeouts for open connections. Timers from #387 must be integrated with the eviction strategy for TfwCliConnection and TCP window calculation (#488).
Connections evictions and TCP window sizes reduction must be done in separate kernel thread. Old connections are typically proven, so the thread should increase QoS for the connections from time to time to mitigate previous penalties on them. When QoS value for a client is increased, the client connections must increase their TCP receive window and socket write memory. The connection should never receive QoS higher than specified in config.
Consider Weboscket (#755) to reduce QoS for clients who tries to exhaust system resources using slow Websocket connections.
JavaScript Challenge
Must set proper timeout for JavaScript challenge. [Is this still relevant after #1102 and #2025?]
Client QoS should also be decreased for 'suspicious' clients by sending them JSCH. From #1102 :
Cloudflare also does not sent JSCH on each request, only on particular triggers (GeoIP, IP repputation, WAF rules).
References
Need to explore mitigation techniques cited by the paper (also referenced by #496), especially ReDoS vulnerabilities and Algorithmic complexity attacks.
Shenango uses ring-buffer work queue saturation as an indicator for overload and this technique is also usable for this issue.
Cloudflare probabilistic approach to provide flows QoS. While the DDoS attack isn't blocked, the innocent stream doesn't hurt.
Understanding Host Network Stack Overheads proposes to adjust TCP buffers depending on the current TCP state (e.g. windows).