krizhanovsky commented 8 years ago

Depends on https://github.com/tempesta-tech/tempesta/issues/1869

Scope

tfw_cache_mgr thread must traverse Web-cache and evict stale records on memory pressure or revalidate them otherwise. The thread must be accurately scheduled and throttled to not to impact system performance as well as efficiently free required memory. #500 must be kept in mind as well.

Validation logic is defined by RFC 7234 4.3 and requires implementation of conditional requests.

Keep in mind DoS attack from #520. Following items linked with #516 (TDB v0.3) must be implemented:

[ ] Revalidate cache entries by specified per-vhost timeout (like S3 lifecycle)
[ ] TDB tables must be dynamically extensible and should not be strictly power of 2, e.g. 7GB should be fine. See comment https://github.com/tempesta-tech/tempesta/issues/1515#issuecomment-1835064500
[ ] UPDATE and DELETE operators must be implemented. Probably the lock-free index should be immutable and deletion should be implemented using thumbstones and updates are just copies of data plus thumbstone for the old data.
[ ] properly implement reinsert and lookup & insert (tdb_rec_get_alloc()) logic from #1115 (temporary implementatied in #1178).
[ ] Race-free interface for large insertions. E.g. __cache_add_node() creates a TDB entry, which immediately becomes visible for other threads, and later tfw_cache_copy_resp() inserts actual data, so concurrent threads may get incomplete or corrupted data. It can be done in 2 phases (soft updates): (1) allocate space in TDB data area and (2) actual insert (index update) to link the data. tfw_client_obtain() modifications from #1178, as well as similar HTTP sessions storage (#685), and __cache_add_node() must be changed to use the soft updates. This also implies some versioning: while a softirq sending data for current cached object (probably very slowly with #391 .1 in mind), the object may stall and/or replaced by a new version, so the new version only must be fetched by new scans while the old version must reside in TDB untill it's fully transmitted and then it should be evicted.
[x] Support/fix constant address placement for small records, see https://github.com/tempesta-tech/tempesta/pull/1178#discussion_r256834550
[x] Generic items removal. On removal the HTrie must be shrinked. With records locking and/or reference counting, probably thumbstone removal should be implemented.
[ ] There must be locks or reference counters for the stored entries to not to delete entries being processed (see e.g. #522)
[ ] custom eviction strategy must be implemented (e.g. Web-cache should register it's callbacks for freshness calculation) such that different tables can use different eviction strategies or no eviction at all. A custom triggers must be supported, e.g. TLS cache should be able to specify maximum number of stored sessions as 50 (see ssl_cache.c).
[ ] besides creation timestamp for eviction, entries must have minimum and maximum lifecycle honored by the eviction strategy
[x] number of memset() calls must be reduced.
[ ] fix for data persistency on clean restart. Introduce non-persistent tables - sessions (#685) and client (#1115) tables should be non-persistent. Probably for Beta we should go with non-persistent tables only (as for now). We definitely should have a configuration option whether to read the full database into RAM on start or just throw out (or do in background for #516 ) all the data
[ ] Web cache data for different vhosts must be stored in different tables to prevent full path collisions and improve concurrency and security (tables separation plus tdbfs user/group access control instead of chroot isolation).
[x] ~The current TDB table size maximum is 128GB, which is too small for the web cache on the modern hardware~ This is teh subject for #400
[ ] At the moment we have very limited number of tables, but we might need to scale to thousands of tables, e.g. for logging #537
[ ] we need to create Tempesta DB tables in runtime (e.g. to reconfigure a hash table for a bots protection algorithm) to load Tempesta Language #102 scripts in run time.
[ ] cache tables must be per-vhost to get rid of unnecessary contention and index splitting for different vhosts. Important for the CDN use case. However, large tables still must be supported for single resource cases.
[ ] Avoid __cache_entry_size() call which introduces an extra response traversal. It seems we can just allocate new TDB data blocks and later reuse them if we have extra space or just ignore the tail if it's unusable.
[ ] Consider to send cached content as compound pages, just like high-speed NICs do this (e.g. see discussions in #447)

The task is required to fix #803.

UPD. Since filtering (#731) and QoS (#488) also require eviction, there job should be done in tdb_mgr thread instead.

UPD. TDB was designed to provide access to stored data in zero-copy fashion, such that cached response body can be sent directly to a socket. This property made several design limitations and introduced many difficulties. However, with TLS we always have to copy data. So TDB design can be significantly simplified with copying. So depends on #634.

Cache eviction

While CART is well known good adaptive replacement algorithm, there are number of caching algorithms based on machine learning, which provide much better cache hit. See for example the survey and Cacheus. Some of the algorithms required access to columnar storage for statistics (common practice in CDNs).

At least some interface for the user space algorithm is required. Probably just CART with some weights, where weights are loaded from the users space into the kernel, would be enough.

The cache must implement per-vhost eviction strategies and space quotas to provide caching QoS for CDN cases. Probably 2-layer quotas are required to not prevent poor configuration issues for bad Vary specification on application side, which may take too much space (linked with #733). Different eviction strategies are required to handle e.g. chunks of live streams (huge data volume, immediately remove outdated chunks) and rarely updated web content like CSS (may service stale entries).

It must be possible to 'lock' some records in evictable data sets (see #858 and #471).

Purging

On this feature implementation we should be able to normally update the site content w/o Tempesta restart or memory leaks. It's hard to track which new pages appeared and which are deleted during site content update, so in this task we need:

full web content purging;
regular expression purging, e.g. /foo/*.php or /foo/bar/*
~immediate (purge in original #501) strategy for the purging (we still need the mode to leave stale responses in the cache for #522);~ Done in #2074

Documentation

Need to update https://github.com/tempesta-tech/tempesta/wiki/Caching-Responses#manual-cache-purging wiki page.

Testing

[ ] Throughput on large cached objects and compare with Nginx
[ ] web content purging with invalidate and immediate strategies
[ ] Test on web cache larger than 4GB in 1 and 2 NUMA nodes with cache modes 1 and 2.

krizhanovsky commented 3 years ago

It seems there is some race in the lock-free index or we actually hit the https://github.com/tempesta-tech/tempesta/issues/500 problem in scenario from #1435 : multiple parallel requests to large file

./wrk -d 3600 -c 16000 -t 8 -H 'connection: close' https://debian:443/research/web_acceleration_mechanics.pdf

combined with the Tempesta restart in the VM

# while :; do ./scripts/tempesta.sh --restart; sleep 30; done

sometimes produce warnings like

[ 1103.775556] [tdb] ERROR: out of free space
[ 1103.810415] [tdb] ERROR: out of free space
[ 1103.845177] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1103.929897] [tdb] ERROR: out of free space
[ 1103.949002] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1103.984315] [tdb] ERROR: out of free space
[ 1104.010543] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1104.070816] [tdb] ERROR: out of free space
[ 1104.080997] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1104.151540] [tdb] ERROR: out of free space
[ 1104.158845] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1104.199489] [tdb] ERROR: out of free space
[ 1104.231891] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
....

krizhanovsky commented 2 years ago

The task must be split. After #788 the most crucial part is removing cache entries for #522 and some basic eviction to get the cache usable, i.e. get rid of the memory leaking.

const-t commented 1 year ago

I've made few roughly benchmarks HTTP2 with enabled caching.

h2load -c700 -m100 --duration=30 -t2 https://debian

Tempesta

1kb response

finished in 30.14s, 337279.80 req/s, 393.06MB/s
requests: 10118394 total, 10188394 started, 10118394 done, 10118394 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 10118394 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 11.52GB (12364696856) total, 1.70GB (1821310920) headers (space savings 23.08%), 9.65GB (10361235456) data
                     min         max         mean         sd        +/- sd
time for request:      391us    404.11ms     69.33ms     52.31ms    64.69%
time for connect:    70.24ms    229.04ms    169.16ms     56.50ms    61.71%
time to 1st byte:   195.61ms    323.51ms    252.20ms     27.06ms    79.96%
req/s           :       0.00     4462.36      803.41      771.99    59.29%

5kb response

finished in 30.23s, 229514.40 req/s, 1.14GB/s
requests: 6885532 total, 6955433 started, 6885532 done, 6885432 succeeded, 100 failed, 100 errored, 0 timeout
status codes: 6885469 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 34.16GB (36684160200) total, 1.16GB (1244661614) headers (space savings 23.00%), 32.83GB (35253572326) data
                     min         max         mean         sd        +/- sd
time for request:    17.12ms    698.47ms    103.21ms     39.88ms    90.88%
time for connect:    73.25ms    237.29ms    165.14ms     56.21ms    69.57%
time to 1st byte:   210.69ms    299.74ms    253.76ms     25.23ms    58.53%
req/s           :       0.00      603.40      366.27      247.73    69.86%

128kb response

finished in 30.36s, 17200.80 req/s, 2.11GB/s
requests: 516024 total, 586024 started, 516024 done, 516024 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 516273 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 63.24GB (67904607399) total, 90.50MB (94901121) headers (space savings 22.71%), 63.01GB (67651755146) data
                     min         max         mean         sd        +/- sd
time for request:    47.50ms      18.31s    998.10ms       1.12s    95.44%
time for connect:    70.58ms    254.74ms    159.74ms     56.57ms    68.43%
time to 1st byte:   203.41ms    474.57ms    360.97ms     78.33ms    58.21%
req/s           :       0.00      181.65       31.60       47.24    77.14%

128kb reponse with HTTP/1

finished in 30.37s, 21665.00 req/s, 2.65GB/s
requests: 649950 total, 719750 started, 649950 done, 649950 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 650181 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 79.52GB (85388074799) total, 142.43MB (149350032) headers (space savings 0.00%), 79.95GB (85844417328) data
                     min         max         mean         sd        +/- sd
time for request:    27.77ms       2.64s    510.89ms    293.07ms    85.45%
time for connect:    76.97ms    210.16ms    152.70ms     47.93ms    69.34%
time to 1st byte:   187.62ms    302.22ms    253.48ms     39.83ms    54.62%
req/s           :       0.00      336.64       48.35       78.75    82.86%

Nginx (nginx/1.23.3)

1kb response

finished in 30.15s, 135510.73 req/s, 150.56MB/s
requests: 4065322 total, 4135322 started, 4065322 done, 4065322 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 4065322 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 4.41GB (4736134430) total, 476.87MB (500034606) headers (space savings 33.15%), 3.88GB (4162889728) data
                     min         max         mean         sd        +/- sd
time for request:     1.45ms       1.54s    530.87ms    307.86ms    70.73%
time for connect:    15.54ms    374.44ms    123.50ms     85.68ms    77.57%
time to 1st byte:   179.61ms    909.80ms    359.37ms    165.22ms    86.00%
req/s           :     109.97      366.27      193.44       80.16    71.71%

5kb response

finished in 30.16s, 168594.90 req/s, 846.10MB/s
requests: 5057847 total, 5127847 started, 5057847 done, 5057847 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 5065270 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 24.79GB (26616104602) total, 599.00MB (628093480) headers (space savings 32.97%), 24.12GB (25896832020) data
                     min         max         mean         sd        +/- sd
time for request:      359us       5.39s    432.35ms    460.44ms    87.07%
time for connect:    22.18ms    265.32ms    123.70ms     63.49ms    57.29%
time to 1st byte:   219.39ms       2.17s    803.55ms    511.62ms    59.57%
req/s           :      55.85      558.71      240.58      163.94    72.29%

128kb response

finished in 30.27s, 16222.27 req/s, 2.05GB/s
requests: 486668 total, 556668 started, 486668 done, 486668 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 548023 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 61.56GB (66099265904) total, 65.85MB (69050898) headers (space savings 32.98%), 61.42GB (65952787645) data
                     min         max         mean         sd        +/- sd
time for request:    21.49ms      29.62s       3.73s       3.07s    71.63%
time for connect:    23.21ms    310.06ms    147.42ms     71.60ms    57.86%
time to 1st byte:   247.08ms       1.68s    754.43ms    418.40ms    52.57%
req/s           :       3.10      175.05       23.13       21.80    88.00%

FYI: Sometimes h2load freezes at the end of benchmarking tempesta. Looks like tempesta holds connection.

krizhanovsky commented 2 months ago

With the latest discussion https://github.com/tempesta-tech/tempesta-test/pull/602/files#r1622305438 and our website purging issue https://github.com/tempesta-tech/tempesta-tech.com/issues/64 , it could make sense to make the eviction thread also send conditional requests for particular resources (typically defines as dynamic, e.g. wiki or blog posts in our case).

This causes extra overhead to both the upstream and Tempesta servers and introduces delays. It's much worse than cache purge plugins, but it would solve our problem and maybe similar problems of others. TBD: it solves the problem not in a nice way and requires development effort...

tempesta-tech / tempesta

TDBv0.2: Cache background revalidation and eviction #515