TODO: persistent disk cache

nisbet-hubbard commented 1 month ago

A persistent disk cache appears to be one of the remaining gaps to be filled before rpxy reachs a baseline feature parity with traditional reverse proxies such as Nginx, Apache and Caddy. Since it’s already on your TODO list, I figure it might not be out of place to have a thread tracking this.

Would the cacache crate fit the bill here?

https://github.com/zkat/cacache-rs

Gamerboy59 commented 1 month ago

If implemeted, this should be placed behind a big warning for performance issues compared to in-memory cache.
I'm not against implementing it but this shouldn't be a prominent feature because rpxy is aimed at being fast and I/O operations are working like a break.

nisbet-hubbard commented 1 month ago

With sendfile(), disk cache can be faster than in-memory cache.

openssl has SSL_sendfile. For rustls, see https://github.com/rustls/ktls

By all means document any trade-offs (eg slower warmup on reboot in exchange for faster cache serving), but benchmark it first.

And then make sure nginx is configured with kTLS in bench too.

junkurihara commented 1 month ago

Hey, I believe the "persistence" problem should be handled separately from the use of kTLS.

Yes, rustls using the Kernel TLS would be faster than the vanilla rustls, and SSL_sendfile could be also a silver bullet for quick I/O operations.

But but but! How should we manage the mapping of requested urls and exact files? If we consider the persistence of files, we may need to manage it using database or other data structures. Also, if persistent caches should be update dynamically (insertion of files), this problem would be more complex.

Currently, I have taken an approach of hybrid caching with on-memory and disk caches, both of which are ephemeral (deleted once restarted). So the management of mapping is really simple and fast, just by using a kind of hash table directly managed by rpxy itself. (For this disk cache, we might be able to use rustls with kTLS? Not sure at this point.)

On the other hand, persistent caches with dynamic updates require an extra mapping management, externally from rpxy. So, from this point of view, I agree with @Gamerboy59 since it must involve the overhead of searching operations in the external table. (But this might be negligible in an appropriate database.)

Gamerboy59 commented 1 month ago

Caching can be crucial to some applications but if it's implemented in rpxy this will have an effect on all applications since the cache function will run on every call. For most basic applications the current hybrid cache should be sufficient. For Applications that allow caching in a larger scale (e.g. wordpress), it'd consider a per application proxy like varnish. So you have a setup like where you configure rpxy to forward some websites to the varnish cache server and all other websites directly to their endpoint.

e.g. for a wordpress website:
rpxy (varnish as upstream configured like localhost:6081)-> varnish (apache configured as upstream like localhost:8080)-> apache with wordpress
in short: rpxy -> varnish -> apache

e.g. for other websites that don't require caching:
rpxy (apache as usual upstream like localhost:8080)-> apache
in short: rpxy -> apache

This way you could deploy rpxy to make use of it's flexibility and speed but also use a dedicated and highly effective cache where necessarry. I haven't done any benchmarking but I know the effectiveness of varnish hence I believe you might get a better cache rate with this setup than if rpxy implemented deeper caching functions.

nisbet-hubbard commented 1 month ago

Excellent points.

How should we manage the mapping of requested urls and exact files? If we consider the persistence of files, we may need to manage it using database or other data structures.

For what it might be worth, both nginx and Apache use filesystem for this.

Apache is particularly explicit about its implementation: ‘To store items in the cache, mod_cache_disk creates a 22 character hash of the URL being requested. This hash incorporates the hostname, protocol, port, path and any CGI arguments to the URL, as well as elements defined by the Vary header to ensure that multiple URLs do not collide with one another. Each character may be any one of 64-different characters, which mean that overall there are 64^22 possible hashes. For example, a URL might be hashed to xyTGxSMO2b68mBCykqkp1w. This hash is used as a prefix for the naming of the files specific to that URL within the cache, however first it is split up into directories as per the CacheDirLevels and CacheDirLength directives. CacheDirLevels specifies how many levels of subdirectory there should be, and CacheDirLength specifies how many characters should be in each directory. With the example settings given above, the hash would be turned into a filename prefix as /var/cache/apache/x/y/TGxSMO2b68mBCykqkp1w. The overall aim of this technique is to reduce the number of subdirectories or files that may be in a particular directory, as most file-systems slow down as this number increases. With setting of "1" for CacheDirLength there can at most be 64 subdirectories at any particular level. With a setting of 2 there can be 64 * 64 subdirectories, and so on. Unless you have a good reason not to, using a setting of "1" for CacheDirLength is recommended. Setting CacheDirLevels depends on how many files you anticipate to store in the cache. With the setting of "2" used in the above example, a grand total of 4096 subdirectories can ultimately be created. With 1 million files cached, this works out at roughly 245 cached URLs per directory.’

nginx copies this structure almost entirely.

if persistent caches should be update dynamically (insertion of files), this problem would be more complex

Right, a cache lock is usually employed.

rpxy -> varnish -> apache

Is there any reason to prefer this setup over the simpler rpxy -> nginx? nginx is itself a caching proxy, and as a web server handles both dynamic and static content faster than Apache because of its event loop, which Apache’s event MPM can’t really match in terms of being non-blocking and consuming very little memory.

Gamerboy59 commented 2 weeks ago

rpxy -> varnish -> apache

Is there any reason to prefer this setup over the simpler rpxy -> nginx?

Yes, varnish has a much better caching rate in my experience and it integrates well into many webapps like wordpress (varnish plugin). Cache management is much simpler with it. Additionally, apache supports .htaccess, ruby and other stuff which is more complex on nginx, especially in a shared environment. The memory footprint is slightly higher for apache but still reasonable imo. The speed of loading an entire website is usually slowed down on the application server behind (like php, ruby or perl) where the webserver is not the bottleneck.
For static content nginx, yes, but for everything else apache imo.

nisbet-hubbard commented 2 weeks ago

Right, how much slower is very much application-dependent. Proxying (non-blocking) nodejs apps over websocket is where Apache is significantly slower.

Another thing that makes Varnish different from both nginx and Apache's mod_cache is it does in-memory caching by default, like rpxy.

junkurihara / rust-rpxy

TODO: persistent disk cache #204