Probably good to be done with #1902

General requirements

Tempesta must support two modes of operation: HTTP messages buffering as now and streaming. In current mode of operation all HTTP messages are buffered, i.e. we deliver a proxied request or response only when we fully receive it. As opposite each received skb must be forwarded to client or server immediately.

Full buffering and streaming are two edge modes and intermediate mode must be supported as well - partial message buffering based on TCP receive buffer.

HTTP headers must always be buffered - we need the header to decide what we should do with the message and how to forward/cache/whatever to do with it.

Configuration

The behavior must be controlled by new configuration options. Since Linux doesn't support per-socket sysctls available for a user, we have to introduce our own memory limits for server and client sockets separately.

client_mem <soft_limit> <hard_limit> - controls haw many memory is used to store unanswered client requests and requests with linked responses which can not be forwarded to a client. If soft_limit is zero, then streaming mode is used, i.e. each received skb is immediately forwarded. Otherwise the message is buffered, but not more than for soft_bytes bytes. hard_limit is 2 * soft_limit by default, see description of in security operations section.
client_msg_buffering N - controls message buffering. Buffer only first N bytes of requests when forwarding requests to backend servers. Longer messages are forwarded part-by-part and never fully assembled in Tempesta. If request headers are longer than N bytes, then they are still buffered, since full set of headers is required to correctly serve and forward the request. The limit is applied per message and must not overcome per-connecton client_conn_bufferng limit.
server_msg_buffering N - same as client_msg_buffering but for the server connections.

Previous attempts to implement the issue use client_rmem very similar to tcp_rmem sysctl, however the current client_mem is very different because it accounts linked server responses.

TCP interactions

All ingress skbs are immediately evicted and ACKed from the TCP receive queue, so actually we don't use TCP receive buffer for now. With the new enhancement we must account all HTTP data kept in Tempesta memory as residing in the TCP receive buffer of the socket which we received the data on, so TCP will send lower TCP receive windows. See ss_rcv_space_adjust() and tfw_cli_rmem_{reserve,release}() in #1183.

client_mem <soft_limit> <hard_limit> determines TCP receive window, but it also account responses. A simplified example for client_mem 10 20:

0. initial rwnd=10
1. receive 5B request:  client_mem=5, announce rwnd=5
2. receive 10B response:  client_mem=15, announce rwnd=0
3. forward response & free request: announce rwnd=10
4. receive 3 pipelined reqs of 10B size in total: client_mem=10, rwnd=0
5. receive a resp of 5B for 2nd req: client_mem=15 (we have to keep the resp)
6. receive a resp of 5B for 3rd req: client_mem=20=hard_limit -> drop the connection

In proxy mode we have to slow down fetching data from the server TCP receive queue if we read a response for a slow client, which can't read it with the same speed. Otherwise we can overrun our RAM in many clients, just somewhat slower than the servers (but they not necessary be really slow!). This might lead to HoL problem when a response pipelined by the server after the problem response will stay in the queue for a very long time and a good and quick client will experience significant delays. To cope with the issue server_queue_size and larger conns_n should be used for the server group (please add this to Wiki!). Dynamically allocated server connections from https://github.com/tempesta-tech/tempesta/issues/710 and server HTTP/2 https://github.com/tempesta-tech/tempesta/issues/1125 are more robust solutions to the problem.

Following performance counters must be implemented for traceability of the feature (e.g. to debug the problem above):

[ ] current buffers size for client and server connections
[ ] TBD

HTTP streams

HTTP/2 (#309) and HTTP/2 (QUIC, #724) introduce flow control which can efficiently throttle clients, so it seems the TCP window adjustments make sense only for HTTP/1.1 and the issue highly depends on QUIC and HTTP/2. RFC 7540 5.2.2 begins right from the issue of this task - memory constraints and too fast clients which must be limited whereby WINDOW_UPDATE.

The security aspect of the issue is that clients can request quite large resources and announce very small windows (see RFC 7540 10.5) leading to memory exhaustion on our side (they can do the same with TCP & HTTP/1.1 for now).

At least following things must be done in the issue:

[x] ~some streaming in context of #488 : we should not keep in memory more data from the server response than a client announced in it's window, i.e. we should announce smaller TCP window for server connection. This point is good to do in generic way: we should handle the window from TCP layer, HTTP/2 and HTTP/3 in future.~ Actually, this is a tradeoff for buffering and streaming modes, which must be decided by an administrator. #488 can determine malicius/slow clients and mitigate their impact thought.
[ ] honour the client announced window and do not send more data than it was specified
[ ] announce real HTTP/2 window according to the configured buffer size.

X-Accel-Buffering header processing must be implemented to let a client manage the buffering (e.g. Dropbox does this).

If we receive RST_STREAM frame in streaming mode, then we should reset our stream with the upstream as well and it store only the head of the transferred response in the cache.

Security operation

995 makes an example how a client can exhaust memory by the very first blocking request and many pipelined requests with large responses. So `client_mem` must account the whole memory spent for a client. If client reaches the soft limit a zero receive window is sent. However, server responses for already processed requests may continue to arrive and if the `hard_limit` is reached, then the client connection must be dropped (we have no chance to send a normal error response in this case).

A malicious client may send byte by byte in streaming mode to overload a backend. This scenario must be addressed by the implementation, e.g. to configure minimum buffer size - only if an administrator allows 1 byte buffering or so, then only in this case pass through so small stream chunks. The other opportunity is DDoS QoS reduction in sense of automatic classification in #488.

Several implementation notes

These notes must be mirrored in the Wiki.

A streamed message consumes a server connection and we can not scheduler other requests to the connection, so using small buffers isn't desired.
Streamed requests can not be resent on server connection failures.

Tricky cases

From #1183 :

Streamed messages can be buffered. This happen, when the receiver is not ready to receive a stream. E.g. client requested two pipelined uris: first is very slow to complete, second is fast, but it can be a full BD image. It's not possible to stream the BD image until the first request is responsed. We can't put server connection on hold.

In general, our design must be as simple as possible. Say both the requests go to different connections. Both of them can be streamed or a first request may just require heavy operations on server side, but the 2nd request can be streamed immediately. As we receive server_msg_buffering of response data, we link the data with TfwHttpResp (just as previously - received skbs are linked to the structure), the response is marked as incomplete and stays in the client seq_queue. Yes, the server connection is getting on hold. If the server processing the first request is stuck, then failovering process takes place and both the requests will be freed and both the server connections must be reestablised. We also have https://github.com/tempesta-tech/tempesta/issues/710 addressing the problem of hold connections.

Response can appear at any time.

We need to forward the responses immediately to a client, just mimicing the server. Probably we also should close connection. While the client request is not finished, TfwHttpReq should be sitting in server forward queue and we should forward new skbs of the request to the server connection and response skbs to the client connection. Existing TfwHttpReq and TfwHttpResp descriptos should be used for the skb buffered forwarding.

Target connections to stream a message can dissappear at any time. Client disconnected when streamed message wasn't received in full

A connection dropping or skbs dropping can be good there. It's good to see how HAproxy, Nginx or Tengine behave in the same situations.

Relating issues

This implementation must use TCP receive buffer size to control how much data can be buffered (i.e. be on-the-fly between receive and send sockets). Meantime #488 adjusts TCP receive buffer size to change QoS. So this issue is foundation for #488.

Appropraite testing issue https://github.com/tempesta-tech/tempesta-test/issues/87

See branch https://github.com/tempesta-tech/tempesta/tree/ik-streaming-failed and discussions in https://github.com/tempesta-tech/tempesta/pull/1183

TEST

[ ] try test case from https://github.com/tempesta-tech/tempesta-test/issues/512

I have a bunch of questions on the task:

What is the right behaviour for streaming: can we unconditionally buffer message headers? Some headers affects how TempestaFW processes the message. E.g., cache module and cache control headers.
I assume that only messages that fit buffer size can be cached.
Long polling is another question that bothers me. Imagine we have 32k connections to backend servers, and have 32k clients that use long polling In this case server connections are depleted and we can't serve new client connections.

What is the right behaviour for streaming: can we unconditionally buffer message headers? Some headers affects how TempestaFW processes the message. E.g., cache module and cache control headers.

Basically, yes, we can do some caching unconditionally. To protect memory exhausting attacks we provide Frang limits. Let's discuss the details in chat.

I assume that only messages that fit buffer size can be cached.

Not at all. The process is orthogonal to caching. For the cache we should behavior the same way as now for skb chunks of data.

Long polling is another question that bothers me. Imagine we have 32k connections to backend servers, and have 32k clients that use long polling In this case server connections are depleted and we can't serve new client connections.

Yes, good point. This is the subject for https://github.com/tempesta-tech/tempesta/issues/710 . I'll add the note to the issue.

UPD. The important note from the chat discussion is that proxy_buffering should be implemented on top of TCP socket receive buffer since we can and we should use tight integration with TCP/IP stack. Also consider related points in #391 and #488.

We have to postpone the issue until the Beta. Attempts done so far: https://github.com/tempesta-tech/tempesta/pull/1012 https://github.com/tempesta-tech/tempesta/pull/1067 https://github.com/tempesta-tech/tempesta/pull/1183 (the first two's reviews are summarized in the issue).

When the task is done, commits for the temporal workaround #1184 from https://github.com/tempesta-tech/tempesta/pull/1192 must be reverted.

The current limit of 1GB body must be replaced by switching between buffering and proxying modes: when a message body reaches the specified buffer size, Tempesta FW must switch to proxying mode for this message.

Visualization of the difference in packet processing between Tempesta and Nginx. Backend -> Proxy packets are in red, Proxy -> Client are in green.

Tempesta

tempesta

Nginx default settings (buffering is on)

nginx_buffering

Nginx, buffering is off

nginx

Nginx log wnen backend closes the connection in the middle of a transfer:

2022/10/10 16:15:12 [error] 387031#387031: *1 upstream prematurely closed connection while reading upstream,
client: 127.0.0.1, server: tempesta-tech.com, request: "GET /1 HTTP/1.1",
upstream: "http://127.0.0.1:8000/1", host: "127.0.0.1"

Client (curl) reaction:

> Host: 127.0.0.1
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.18.0 (Ubuntu)
< Date: Mon, 10 Oct 2022 16:15:10 GMT
< Content-Type: text/html
< Content-Length: 1073741824
< Connection: keep-alive
< Last-Modified: Mon, 10 Oct 2022 16:15:00 GMT
< ETag: "63444504-40000000"
< X-Upstream-Id: 1
< Accept-Ranges: bytes
< 
{ [12016 bytes data]
 41 1024M   41  424M    0     0   446M      0  0:00:02 --:--:--  0:00:02  445M* transfer closed with 138765652 bytes remaining to read
 87 1024M   87  891M    0     0   480M      0  0:00:02  0:00:01  0:00:01  480M
* Closing connection 0
curl: (18) transfer closed with 138765652 bytes remaining to read

It's better to refer to Nginx implementation, which is very mature in production. Simply put, the only difference is that we have to implement buffering at the sk/skb level.

References: https://www.getpagespeed.com/server-setup/nginx/tuning-proxy_buffer_size-in-nginx http://luajit.io/posts/openresty-lua-request-time/

Let me give more clarification about the nginx proxy module.

The buffers are shared by the upstream and the downstream, i.e. when the buffer is written to the downstream successfully, it will be used to receive data from the upstream in turn.
As I said in https://github.com/tempesta-tech/tempesta/issues/1902#issuecomment-2092107343, the cache reserves the original format of the response, i.e. Content-Length: 100 or Transfer-Encoding: chunked, it saves the whole response, so if the response is not fully received, the blocks are saved in buffers or temporary files (slice module is an exception of course). When the response comes from the cache next time, it just replays the response to the downstream as is.
Temepesta saves the whole response in the memory, which is in OOM risk. In comparison, Nginx will only receive bytes with a maximum size of proxy_buffer_size + proxy_buffers (saved in memory) + proxy_max_temp_file_size (saved in file).

Directive	Description
`proxy_buffer_size`	the size received one time.
`proxy_buffers`	the total size could be used to receive from the upstream or to the downstream, besides `proxy_buffer_size`.
`proxy_busy_buffers_size`	the size to balance when sending response blocks to the downstream.

nginx_proxy_pass

Let me explain some critical steps in the diagram above.

If proxy_buffering is on.

In step 20, it tries to read the response into the buffers specified by proxy_buffer_size and proxy_buffers, or, if the buffers are full, write the response to temporary files, or, if the files are full (determined by proxy_max_temp_file_size), stop reading the upstream.
In step 23, if the downstream is writeable, then write response blocks with size of proxy_busy_buffers_size to the downstream. If successful and if the upstream is blocked because of no free buffers, it turns to read the upstream.

If proxy_buffering is off, then check steps 25 and 27, these steps will forward the response bytes with a maximum size of proxy_buffer_size. Note that in this mode, the cache is disabled.

A relevant discussion https://github.com/tempesta-tech/linux-5.10.35-tfw/pull/17/files#r1637865281 and the subject of the discussion #2108 , which I think should be fixed in this issue: we need to push for transmission not more (or not significantly more) data to a client connection and h2 stream than the connection or stream allows. Since we proxy data, we should backpresure upstream connections and limit the most aggressive clients with #488.

sysctl_tcp_auto_rbuf_rtt_thresh_us automatically adjust TCP rcv buffer per socket depending on the current conditions. Probably mostly affect #488.

tempesta-tech / tempesta

HTTP message buffering and streaming #498