Optimizer for HTTP messages adjustment

krizhanovsky commented 5 years ago

Tightly linked with #634 which is required for #515.

Current logic

Currently we adjust HTTP messages like:

tfw_http_adjust_req(TfwHttpReq *req)
{
        r = tfw_http_sess_req_process(req);

        r = tfw_http_add_x_forwarded_for(hm);

        r = tfw_http_add_hdr_via(hm);

        r = tfw_http_msg_del_hbh_hdrs(hm);

        r = tfw_http_set_loc_hdrs(hm, req);

        return tfw_http_set_hdr_connection(hm, TFW_HTTP_F_CONN_KA);
}

This is a bunch of adjustment calls and each of the call does (1) relatively expensive skb fragmentation and (2) generates independent set of new skb frags. So at the end of the call there are many skb fragments.

For example consider we have following HTTP request:

GET / HTTP/1.1\r\n
Host: example.com\r\n
X-Forwarded-For: 1.1.1.1\r\n
Connection: keep-alive\r\n
Accept: */*\r\n
Keep-Alive: timeout=600, max=65526\r\n
Cookie: foo\r\n
\r\n

And we need to:

add 2.2.2.2 to X-Forwarded-For
rewrite Connection to close
Add Via: 1.1 foo.com (Tempesta FW)
remove header Keep-Alive

Currently we do following fragmentation steps:

generate a new fragment for 2.2.2.2 and split the request to 2 frags - we have 3 frags now;
keep-alive is shorter than close, so the last current fragment is split into two around the gap - 4 frags now;
Via is inserted before the trailing CRLF - one more split and one more new frag - we come to 6 frags;
Removal of the Keep-Alive header produces a gap, so we finish with 7 frags for very small HTTP request.

We do not move trailing CRLF and do the fragmentation stuff to save data pointers in HTTP message representation, TfwHttpMsg.

Preliminary logic

Actually we have some rudiments for the logic, in particular we treat all user specified headers modifications in TfwHdrMods and process them at once in functions like tfw_h2_resp_adjust_fwd() and tfw_h2_resp_next_hdr().

Proposed design

The target of the function is to make only 1 memory (skb frament) allocation in worst case and use current skb overhead in most cases (see Testing section).

The adjustment logic must be split into 3 phases:

identify and collect metadata about all the adjustments - how much data we have to remove, add or modify and at which places;
run an optimizer, a separate and self-sufficient function which from the metadata above generates a vector of data points and lengths to add or remove with pointers to a chain of the adjustments. Some of the logic is already done in tfw_http_msg_hdr_xfrm_str(), so the work should affect the function.
the vector is used for final data placement in the underlying HTTP message.

This is a generalized and more powerful way for what we currently do in tfw_http_msg_del_hbh_hdrs() by leaving Connection header for further rewrite.

The optimizer must coalesce all the data gaps gap_sz ignoring small (less than ZCP_MIN_DSZ) data between them. ZCP_MIN_DSZ is a define, I'd set it to something like 4 or 8 cache lines (please do a benchmark and find the best number). It must also consider available room in the skb linear data skb_room and size of data to insert add_sz. If gap_sz + add_sz <= skb_room && gap_sz + add_sz < ZCP_MIN_DSZ, then just completely rewrite the part of the message. Use memmove() and place new data at the end of the message instead of copying it in a auxiliary buffer. In this example it makes sense to just rewrite the whole data after ...1.1.1.1 - use memmove() to replace current headers if we have enough room or allocate 1 fragment and copy Cookie with CRLF's to the the new fragment.

Obviously if we have a long Cookie in the example above, then we still can insert all the new headers before the Cookie and split the message, so we'll have 2 frags here.

The good news is that we call tfw_http_adjust_{req,resp}() just before the message forwarding, so we don't have to fix TfwHttMsg's headers table.

HTTP/2

The issue becomes more crucial with HTTP/2 since HTTP/2 uses encoded HTTP headers, so an HTTP message is smaller and will have more quite small fragments introducing larger overhead than we have for HTTP/1.1.

Moreover, HTTP/1.1 <-> HTTP/2 transformations essentially require rewriting of each HTTP field, so basically for the message transformation it makes sense just to fully rebuild all the headers. In this case we need size of required allocation from the optimizer function.

The only exception when we may want to use skb fragment is too large header values, e.g. big Cookie value or URI. In this case we also make only one memory allocation, place all data before the big header(s) and after the header(s) in separate framents.

Probably, the skb fragmentation code (which is quite overcomplicated) isn't needed any more and can be just removed. Need to review it.

Also see https://github.com/tempesta-tech/tempesta/pull/1368#discussion_r349317997 for tfw_h2_resp_next_hdr() and TfwHdrMods:

...we can extend TfwHdrMods by dynamically decided headers along with always-modified
headers (e.g. Via) and execute the optimization logic on top of the data structure.

This leads us to typically not so small h_mods, but probably that's fine to check the whole
list on each processed headers - probably it's cheaper than add additional map to remember
which headers we already processed, but please consider the option.

Actually, that's fine to implement the task for HTTP/2 only as the main and performance-focused functionality.

Testing

The logic is in the core and involved in almost any proxy test case, so no need for specific tests. However, performance evaluation of current and optimized code is required as well performance measurements for defined parameters.

Choosing the right value for ZCP_MIN_DSZ is crucial for the task. It could happen that it makes sense to copy up to 1KB of data or more. So please prove the concept of Copy vs Zero-copy with good benchmarks in the hardware installation and a VM.

Next argument to be evaluated is current skb allocation overhead. It may happen that with enlarged default overhead we can avoid additional skb fragments allocations.

Also please evaluate an opportunity to reserve more room in skb head, so that we can place more data for example before Cookie header without fragmentation, just moving network headers and request line.

krizhanovsky commented 4 years ago

At the moment skb fragmentation looks quite expensive with all the free space lookups, new skbs allocations and fragments readjusting, so

need to benchmarks HTTP/2 <-> HTTP/1.1 proxying with HTTP headers rewriting and intensive HTTP/2 fragmentation (use small SETTINGS_MAX_FRAME_SIZE on a client)
Analyze pfl_misses and add more similar performance measures to track the low level network performance
figure out optimization approaches, e.g. use scatterist instead of skb, and/or extend skb to handle more than 16 fragments for at least before tcp_write_xmit().

krizhanovsky commented 1 year ago

In the most cases the most headers are small and doing the complex logic, especially with the proposed optimizer, is overkill and will slow down the logic even further. Instead we just should:

add headers size field to HttpMsg and update it in the parser.
allocate a buffer (probably using pg_skb_alloc() allocator) as the next power of 2 for the size from (1), maybe with the addition of new headers if the size is known. Also we should allocate enough space for HTTP/2 framing headers. It seems the buffer should be pinned to HttpMsg for #498 : the buffer can be used between the transfers
just copy the headers, probably HPACKed in case of HTTP/2, to the buffer

As an optimization we can still use the zero copy technique if the headers don't fit a page and there is a header larger than a page.

krizhanovsky commented 1 year ago

It seems we can not fully avoid the "optimizer" and just use " the next power of 2 for the size from (1)" as in the previous comment: we can delete headers, add headers per vhost and even use variables (e.g. for URI) and we plan to extend the headers operation logic in HTTPtables - all in all there are a lot of variables. From the other side, if we say have a message with 510 bytes of headers, then apparently we fail with the next power of 2.

Another solution, at least for HTTP/2, could be to:

rewrite the headers in hope we have enough space. Fallback to the current fragmentation scheme if there is a header larger than a page (the maximum header size must be set by the parser)
allocate a page and write the rest of headers to the page. Repeat the step if the page isn't enough
if we go with fragments in step 1 than the page should be used for all the fragments
account say 80% of cases of required extra size and do not try to reuse the rest of the page if it's smaller than the number
on a next message just reuse the page and get an additional reference counter

BTW can we just use TfwPool associated with the modified HTTP message and jsut get() its pages to be able to free the pool before an skb with the pages is actually sent to NIC?

krizhanovsky commented 1 year ago

Just collected PFL statistics on our website installation (HTTPS):

SS pfl hits                             : 8745
SS pfl misses                           : 169902

It turns out that we should remove the optimization since it doesn't work in the most cases

const-t commented 1 year ago

I suggest using following strategy for adding additional headers in HTTP2 messages:

The response has no body: We allocate a new fragment from TfwPool and add it to the end, due to the fact that we use TfwPool we do not need to know the size of additional headers, we will use realloc while encoding the headers.
Response with body: We encode headers in-place, when we reach the body, we split the current fragment, after it we add a new fragment allocated in TfwPool. The first DATA frame will be added to the end of this fragment.

krizhanovsky commented 1 year ago

Minutes for the #1820 benchmark:

need kernel config that all of us can reproduce the results
full information on Tempesta FW config, benchmark command line, backend configuration and content
real life request headers (e.g. take headers from a browser request to github.com, amazon.com or slack.com)
real life server response headers (our new wordpress website, github.com, amazon.com etc)
6 header additions in Tempesta FW config as we observed on a client setup
make sure that TLS handshakes don't impact the benchmark (use keep alive connections and/or abbreviated TLS handshakes)
since it seems #1820 doesn't provide any performance gain, place collect perf top for the current master and #1820 branch. If the most time is from the boddy allocator, then use page_owner to track who is the reason

const-t commented 1 year ago

In #1820 was introduced new approach for HTTP/1.1 <-> HTTP/2 transformations:

Detach all skb that contain headers from skb_head.
HPACK encode all headers using newly allocated skbs. Paged fragments for these skbs allocates using TfwPool.

Current approach solves skb fragmentation problem, because now headers places into continuous memory blocks, thus we avoid fragments of small size. Also solves problem with too small response, that can't be encoded in-place, because we don't encode headers in-place anymore.

HTTP1.1 messages adjustment is in the same state and requires some tests to determine if optimizations are needed.

const-t commented 1 year ago

One more subject of this issue: During addition headers for HTTP1 response/request SKB may be heavy fragmented that leads to moving fragments from current SKB to next, however if data that contained in these fragments used by TfwStr, then skb field of this TfwStr become invalid.

tempesta-tech / tempesta