nginxinc / nginx-s3-gateway

NGINX S3 Caching Gateway
Apache License 2.0
527 stars 130 forks source link

No caching when using Docker image. #64

Open kristoferlundgren opened 2 years ago

kristoferlundgren commented 2 years ago

Describe the bug Using the latest Docker image, no data is being cached.

To Reproduce Steps to reproduce the behavior:

  1. Start container docker run --rm -ti -p 80:80 -e S3_SERVER=storage.googleapis.com -e S3_ACCESS_KEY_ID="<key>" -e S3_SECRET_KEY="<secret>" --env-file s3.env nginxinc/nginx-s3-gateway:latest-20221026

s3.env file:

S3_BUCKET_NAME=<bucket-name>
S3_SERVER_PORT=443
S3_SERVER_PROTO=https
S3_REGION=us-east-1
S3_STYLE=virtual
S3_DEBUG=false
AWS_SIGS_VERSION=4
ALLOW_DIRECTORY_LIST=true
PROVIDE_INDEX_PAGE=false
APPEND_SLASH_FOR_POSSIBLE_DIRECTORY=false
PROXY_CACHE_VALID_OK=1h
PROXY_CACHE_VALID_NOTFOUND=1m
PROXY_CACHE_VALID_FORBIDDEN=30s
  1. Pull multiple files from the S3 gateway at http://localhost

I can successfully browse the S3 bucket directory structure and download objects without any issue. Although, when downloading the same object multiple times I cannot see any performance increase from a cache hit.

  1. Exec into container docker exec -ti <container> bash Run the following command: ls -la /var/cache/nginx/s3_proxy/

The cache directory is empty. I also looked for looked for any disk usage increase with the command du -sh /* but no cached data is being stored in the container.

Expected behavior According to the documentation, data should be cached when accessed multiple times and not reloaded from the remote S3 bucket at each access.

Your environment

dekobon commented 2 years ago

Thank you for writing up this issue in such detail.

So far, I've been unable to reproduce this bug using AWS. In my configuration, I've put a text file on my S3 bucket and ran curl against it in a loop.

I saw that the cache files were correctly populated in the /var/cache/nginx/s3_proxy directory. I also monitored the instance for outbound connections via netstat and I only saw outbound connections every minute or so.

On my container, the contents of the cache directory look like:

root@88822b1c11cd:/var/cache/nginx/s3_proxy# find /var/cache/nginx/s3_proxy/
/var/cache/nginx/s3_proxy/
/var/cache/nginx/s3_proxy/1
/var/cache/nginx/s3_proxy/1/93
/var/cache/nginx/s3_proxy/1/93/b620bfa0e09b3cc11521660acb6e2931

I'll go and try to see if I can reproduce the issue on Google Cloud Storage.

dekobon commented 2 years ago

I just ran the same configuration against Google Cloud Storage and I was able to reproduce the behavior.

dekobon commented 2 years ago

I found the source of the issue. Google Cloud Storage diverges from the AWS S3 behavior by setting Cache-Control: private, max-age=0 by default for all objects. You need to edit the metadata for your object on Google Cloud Storage and change the value of Cache-Control to public in order to enable caching with the gateway. See the Cloud Storage Documentation for more information.

There may be a way to configure NGINX to ignore the header sent by Google Cloud Storage by using the proxy_ignore_headers directive to ignore the Cache-Control header.

kristoferlundgren commented 2 years ago

Many thanks for tracking down the root cause of this issue.

As you (@dekobon ) suggested, I added proxy_ignore_headers Cache-Control; to the http {} part of /etc/nginx/nginx.conf, ran nginx -s reload inside the container. And voilà, it works! Files are now cached, as expected.

I now have some choices.

  1. Mount my own /etc/nginx/nginx.conf into the container.
  2. Build and run a modified container image with this tiny modification.
  3. Ask this project to add the proxy_ignore_headers Cache-Control; as part of the config. Preferably configurable with an environment variable.

I would like to first ask for no.3 . What are your thoughts?

Again, thanks!

dekobon commented 2 years ago

I think asking for number three is reasonable. We may need a generalized way to accomplish this because we also need to solve for #65 .

dekobon commented 2 years ago

I've made some updates to the container so that you can now layer in additional NGINX configuration. See the documentation.

Also, I added a feature that allows you to strip out headers from the client response. For Google Cloud Storage you will want to do:

HEADER_PREFIXES_TO_STRIP=x-goog-;x-guploader-uploadid

Please let me know if this solution works for you. If it does, I'll mark this issue as closed.

kristoferlundgren commented 2 years ago
  1. Trying the new feature by added the Cache-Control header: HEADER_PREFIXES_TO_STRIP="x-goog-;x-guploader-uploadid;Cache-Control" Resulted in the error: HEADER_PREFIXES_TO_STRIP must not contain uppercase characters (as documented)

  2. Second try (lowercase Cache-Control): HEADER_PREFIXES_TO_STRIP="x-goog-;x-guploader-uploadid;cache-control" Downloaded some files and then checked the cache directory. -Empty, i.e. Cache is still disabled.

  3. Third try: (stripping x-goog headers and mounting nginx http config file) docker run --rm -ti -p 80:80 -e S3_SERVER=storage.googleapis.com -e S3_ACCESS_KEY_ID="<key>" -e S3_SECRET_KEY="<secret>" -e HEADER_PREFIXES_TO_STRIP="x-goog-;x-guploader-uploadid" --env-file s3.env -v $(pwd)/cache.conf:/etc/nginx/conf.d/cache.conf nginxinc/nginx-s3-gateway:latest Where the $(pwd)/cache.conf file contains: proxy_ignore_headers Cache-Control; Downloaded some files and then checked the cache directory. Cache directory has content. I.e. Cache is working! :)

I would have preferred an environment variable solution, but this config works as well. Many thanks for the assessment and quick remediation of this issue. And also reporting and fixing #65.

Before closing this issue I believe the need for proxy_ignore_headers Cache-Control; ought to be documented to aid usage when s3 backends (ex. Google Cloud Storage) emit caching preferences.

dekobon commented 2 years ago

I agree it should be documented. Also, we may want to add an environment variable that allows for ignoring cache control, but I wanted to get the extensibility part done ASAP because we've gotten a lot of requests for similar things and the number of environment variables is starting to add up.

I'll leave this issue open until we can add a setting.

akashgreninja commented 9 months ago

I made a stupid mistake of exec into the wrong running container with the same name so i didnt find any cache check if this also might be the reason

felipou commented 4 days ago

I've just experienced this issue, and in addition to ignoring the Cache-Control header, I also had to ignore the Expires header for it to work:

proxy_ignore_headers Cache-Control;
proxy_ignore_headers Expires;
kristoferlundgren commented 4 days ago

@dekobon @4141done You two seem to be the current maintainers. I really appreciate your effort to keep the project alive!

From reading various discussions on the subject of caching in this GitHub project, there seems to be a general request to have more control of the ingress and egress cache configuration. Mounting my own cache.conf, replacing the default, still seems like a hack. Is there a more intuitive way to manage cache configuration, or can one be developed with a reasonable effort?