openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
652 stars 379 forks source link

Cache /api/v0/preferences and /api/v0/attribute_groups #8957

Open CharlesNepote opened 1 year ago

CharlesNepote commented 1 year ago

Based on 50 millions nginx log lines analysis, we have found that these URLs represent respectively 2.83% and 2.79% (5.62%) of all requests.

These two files are used to setup preferences. They are generated by Perl, without any database access. See:

It's very easy and efficient to cache them with nginx for a few dozen of seconds (1 minute should be ok, said Stéphane).

We currently (2023-09) have around 3000/6000 requests per minutes. Caching 5.6% of the requests would lead to save around 170/330 req/minute. It would also help in case of peaks.

The nginx conf could be configured this way:

# ***
# * Cache
# *
# * - Introducing article: https://www.nginx.com/blog/nginx-caching-guide/
# * - Long article: https://www.nginx.com/blog/nginx-high-performance-caching/#BasicPrinciplesofContentCaching
# * - Firefox extension to debug http headers: https://addons.mozilla.org/en-US/firefox/addon/http-header-live/
# * 
# * Only two directives are needed to enable basic caching: proxy_cache_path (http{} level) and proxy_cache (server{} level).
# * proxy_cache_path directive sets the path and configuration of the cache, and the proxy_cache directive activates it.
# *   levels:    sets up a two   ^`^qlevel directory hierarchy under /path/to/cache/. Having a large number of files in a
# *              single directory can slow down file access, so we recommend a two level directory hierarchy for most 
# *              deployments. If the levels parameter is not included, NGINX puts all files in the same directory.
# *   keys_zone: sets up a shared memory zone for storing the cache keys and metadata such as usage timers. 
# *              Having a copy of the keys in memory enables NGINX to quickly determine if a request is a HIT 
# *              or a MISS without having to go to disk, greatly speeding up the check. A 1MB zone can store 
# *              data for about 8,000 keys, so the 10MB zone configured in the example can store data for about 80,000 keys.
# *   inactive:  specifies how long an item can remain in the cache without being accessed. In this example, a file that 
# *              has not been requested for 24h is automatically deleted from the cache by the cache manager process, 
# *              regardless of whether or not it has expired. The default value is 10 minutes (10m). Inactive content 
# *              differs from expired content. NGINX does not automatically delete content that has expired as defined 
# *              by a cache control header (Cache-Control:max-age=120 for example). Expired (stale) content is deleted 
# *              only when it has not been accessed for the time specified by inactive. When expired content is accessed, 
# *              NGINX refreshes it from the origin server and resets the inactive timer.
# *   max_size:  sets the upper limit of the size of the cache (to 2 gb in this example). It is optional; not specifying 
# *              a value allows the cache to grow to use all available disk space. When the cache size reaches the limit, 
# *              a process called the cache manager removes the files that were least recently used to bring the cache size back under the limit.
#
# You can check the directory size from time to time: du -sh /var/cache/nginx
proxy_cache_path  /var/cache/nginx  levels=1:2  keys_zone=cachezone:60m  inactive=1h  max_size=200m;

server {
    location ~ ^/api/v./(preferences|attribute_groups) {
        # Activate cache configuration named "cachezone"
        proxy_cache             cachezone;

        # proxy_cache_valid indicates which query codes is concerned by the cache and the caching time
        proxy_cache_valid       any  1m;

        # proxy_cache_use_stale: delivers cached content when the origin is down
        # "Additionally, the updating parameter permits using a stale cached response if it is 
        #  currently being updated. This allows minimizing the number of accesses to proxied servers 
        #  when updating cached data.
        proxy_cache_use_stale          updating error timeout http_500 http_502 http_503 http_504;

        # Adds an X-Cache-Status HTTP header in responses to clients: helps debugging the
        # cache.
        # https://www.nginx.com/blog/nginx-caching-guide/#Frequently-Asked-Questions-(FAQ)
        # Eg. X-Cache-Status: HIT
        add_header X-Cache-Status $upstream_cache_status;
    }
}

To debug and analyze the cache hits, it's possible to create a temporary specific log: (source: https://serverfault.com/a/912897)

# This directive needs to be place in nginx global configuration
log_format cache_st '$remote_addr - $upstream_cache_status [$time_local]  '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

And:

# The path needs to be adapted. These logs should be temporary: please do not keep this logs after the tests.
access_log   /var/log/nginx/domain.com.cache.log cache_st;

Then it's easy to get some stats about the cache, and verify it is working and efficient: (source: https://serverfault.com/a/912897 )

HIT vs MISS vs BYPASS vs EXPIRED awk '{print $3}' cache.log | sort | uniq -c | sort -r

MISS URLs: awk '($3 ~ /MISS/)' cache.log | awk '{print $7}' | sort | uniq -c | sort -r

BYPASS URLs: awk '($3 ~ /BYPASS/)' cache.log | awk '{print $7}' | sort | uniq -c | sort -r

Part of

teolemon commented 1 year ago

@CharlesNepote the website does those requests right ? we do ping in the app once during setup, and then periodically, but I guess we cache it. @monsieurtanuki @g123k

alexgarel commented 1 year ago

@CharlesNepote using https://nginx.org/en/docs/http/ngx_http_memcached_module.html could be far more efficient (we already have the memcached server)

It seems to be available with nginx-extras package.

CharlesNepote commented 1 year ago

I have done more interesting computations, based on 41 millions of requests from nginx logs.

  1. Every minute, these files* are requested at least more than 16 times!!! and sometimes more than 360 times.
  2. If configured to be active for a minute, the cache would be requested 99.21% of the time for /api/v0/attribute_groups.

*https://world.openfoodfacts.org/api/v0/preferences and https://world.openfoodfacts.org/api/v0/attribute_groups

This means that the cache would be very efficient.

@alexgarel I don't understand why using memcached would be far more efficient vs vs reading cache from the filesystem. I would also be more complicated. I feel that nginx directives without any dependency are more robust.

alexgarel commented 1 year ago

@CharlesNepote You are right about the fact that the files will be in cache. So you can implement it that way.