Open nisbet-hubbard opened 2 weeks ago
If implemeted, this should be placed behind a big warning for performance issues compared to in-memory cache.
I'm not against implementing it but this shouldn't be a prominent feature because rpxy
is aimed at being fast and I/O operations
are working like a break.
With sendfile(), disk cache can be faster than in-memory cache.
openssl has SSL_sendfile. For rustls, see https://github.com/rustls/ktls
By all means document any trade-offs (eg slower warmup on reboot in exchange for faster cache serving), but benchmark it first.
And then make sure nginx is configured with kTLS in bench too.
Hey, I believe the "persistence" problem should be handled separately from the use of kTLS.
Yes, rustls using the Kernel TLS would be faster than the vanilla rustls, and SSL_sendfile could be also a silver bullet for quick I/O operations.
But but but! How should we manage the mapping of requested urls and exact files? If we consider the persistence of files, we may need to manage it using database or other data structures. Also, if persistent caches should be update dynamically (insertion of files), this problem would be more complex.
Currently, I have taken an approach of hybrid caching with on-memory and disk caches, both of which are ephemeral (deleted once restarted). So the management of mapping is really simple and fast, just by using a kind of hash table directly managed by rpxy
itself. (For this disk cache, we might be able to use rustls with kTLS? Not sure at this point.)
On the other hand, persistent caches with dynamic updates require an extra mapping management, externally from rpxy. So, from this point of view, I agree with @Gamerboy59 since it must involve the overhead of searching operations in the external table. (But this might be negligible in an appropriate database.)
Caching can be crucial to some applications but if it's implemented in rpxy
this will have an effect on all applications since the cache function will run on every call. For most basic applications the current hybrid cache should be sufficient. For Applications that allow caching in a larger scale (e.g. wordpress), it'd consider a per application proxy like varnish
. So you have a setup like where you configure rpxy
to forward some websites to the varnish
cache server and all other websites directly to their endpoint.
e.g. for a wordpress website:
rpxy (varnish as upstream configured like localhost:6081)-> varnish (apache configured as upstream like localhost:8080)-> apache with wordpress
in short: rpxy -> varnish -> apache
e.g. for other websites that don't require caching:
rpxy (apache as usual upstream like localhost:8080)-> apache
in short: rpxy -> apache
This way you could deploy rpxy
to make use of it's flexibility and speed but also use a dedicated and highly effective cache where necessarry. I haven't done any benchmarking but I know the effectiveness of varnish
hence I believe you might get a better cache rate with this setup than if rpxy
implemented deeper caching functions.
Excellent points.
How should we manage the mapping of requested urls and exact files? If we consider the persistence of files, we may need to manage it using database or other data structures.
For what it might be worth, both nginx and Apache use filesystem for this.
Apache is particularly explicit about its implementation: ‘To store items in the cache, mod_cache_disk creates a 22 character hash of the URL being requested. This hash incorporates the hostname, protocol, port, path and any CGI arguments to the URL, as well as elements defined by the Vary header to ensure that multiple URLs do not collide with one another. Each character may be any one of 64-different characters, which mean that overall there are 64^22 possible hashes. For example, a URL might be hashed to xyTGxSMO2b68mBCykqkp1w. This hash is used as a prefix for the naming of the files specific to that URL within the cache, however first it is split up into directories as per the CacheDirLevels and CacheDirLength directives. CacheDirLevels specifies how many levels of subdirectory there should be, and CacheDirLength specifies how many characters should be in each directory. With the example settings given above, the hash would be turned into a filename prefix as /var/cache/apache/x/y/TGxSMO2b68mBCykqkp1w. The overall aim of this technique is to reduce the number of subdirectories or files that may be in a particular directory, as most file-systems slow down as this number increases. With setting of "1" for CacheDirLength there can at most be 64 subdirectories at any particular level. With a setting of 2 there can be 64 * 64 subdirectories, and so on. Unless you have a good reason not to, using a setting of "1" for CacheDirLength is recommended. Setting CacheDirLevels depends on how many files you anticipate to store in the cache. With the setting of "2" used in the above example, a grand total of 4096 subdirectories can ultimately be created. With 1 million files cached, this works out at roughly 245 cached URLs per directory.’
nginx copies this structure almost entirely.
if persistent caches should be update dynamically (insertion of files), this problem would be more complex
Right, a cache lock is usually employed.
rpxy -> varnish -> apache
Is there any reason to prefer this setup over the simpler rpxy -> nginx? nginx is itself a caching proxy, and as a web server handles both dynamic and static content faster than Apache because of its event loop, which Apache’s event MPM can’t really match in terms of being non-blocking and consuming very little memory.
rpxy -> varnish -> apache
Is there any reason to prefer this setup over the simpler rpxy -> nginx?
Yes, varnish has a much better caching rate in my experience and it integrates well into many webapps like wordpress (varnish plugin). Cache management is much simpler with it. Additionally, apache supports .htaccess
, ruby
and other stuff which is more complex on nginx, especially in a shared environment. The memory footprint is slightly higher for apache but still reasonable imo. The speed of loading an entire website is usually slowed down on the application server behind (like php, ruby or perl) where the webserver is not the bottleneck.
For static content nginx, yes, but for everything else apache imo.
A persistent disk cache appears to be one of the remaining gaps to be filled before rpxy reachs a baseline feature parity with traditional reverse proxies such as Nginx, Apache and Caddy. Since it’s already on your TODO list, I figure it might not be out of place to have a thread tracking this.
Would the cacache crate fit the bill here?
https://github.com/zkat/cacache-rs