Closed ruudk closed 4 years ago
The redirect responses are cached and at some point these URLs expire on Cloudfront/S3.
This leads to a Error: ErrImagePull
for pods that try to fetch docker images via the docker-registry-proxy.
Failed to pull image "quay.io/calico/cni:v1.11.5": rpc error: code = Unknown desc = error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>AccessDenied</Code><Message>Access denied</Message></Error>"
Yeah, aggressive caching was a bit the point of this proxy. See logic in nginx.conf from L202 on Ideal scenario: During the first client's request to a new blob, the 302 response should be cached. If the original client followed the redirect with a get for the S3 signed resource, then that should have been cached also.
Obviously it's not working that way so needs some digging.
Are you at or nearby cache size limits? Long uptime?
If the 302 is still in cache, but the cloudfront 200 was purged, you could get into that scenario.
Watch the proxy logs during docker rmi/pull combo for the image in question; are there HITs for the blob 302's, but MISSes for its related /forcecachesecure
request?
I was able to reproduce this locally.
Start mirror:
docker run --rm --name docker_registry_proxy -it -p 0.0.0.0:3128:3128 -v /Volumes/CS/www/website/docker_mirror_cache:/docker_mirror_cache -e REGISTRIES="k8s.gcr.io gcr.io quay.io your.own.registry another.public.registry" rpardini/docker-registry-proxy:0.2.4
Logs cleared after starting up, this is what is shown:
// cache empty
MISS [06/Jan/2020:08:32:00 +0000] "/v2/calico/cni/blobs/sha256:2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f" 302 1261 "HOST:quay.io" "PROXY-HOST:quay.io" "UPSTREAM:107.21.16.71:443"
MISS [06/Jan/2020:08:32:01 +0000] "/forcecachesecure/d3uo42mtx6z2cr.cloudfront.net/originalwas/sha256/2c/2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f" 200 385 "HOST:docker.caching.proxy.internal" "PROXY-HOST:d3uo42mtx6z2cr.cloudfront.net" "UPSTREAM:13.224.86.46:443"
// remove the cached redirect response
MISS [06/Jan/2020:08:33:02 +0000] "/v2/calico/cni/blobs/sha256:2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f" 302 1261 "HOST:quay.io" "PROXY-HOST:quay.io" "UPSTREAM:34.231.142.215:443"
HIT [06/Jan/2020:08:33:02 +0000] "/forcecachesecure/d3uo42mtx6z2cr.cloudfront.net/originalwas/sha256/2c/2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f" 200 385 "HOST:docker.caching.proxy.internal" "PROXY-HOST:d3uo42mtx6z2cr.cloudfront.net" "UPSTREAM:-"
// remove the cached binary but leave the redirect response
// wait for 10 minutes to make sure its expired?
HIT [06/Jan/2020:08:44:12 +0000] "/v2/calico/cni/blobs/sha256:2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f" 302 1261 "HOST:quay.io" "PROXY-HOST:quay.io" "UPSTREAM:-"
MISS [06/Jan/2020:08:44:12 +0000] "/forcecachesecure/d3uo42mtx6z2cr.cloudfront.net/originalwas/sha256/2c/2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f" 403 110 "HOST:docker.caching.proxy.internal" "PROXY-HOST:d3uo42mtx6z2cr.cloudfront.net" "UPSTREAM:99.86.113.66:443"
I use this command to test the proxy:
env HTTP_PROXY=http://127.0.0.1:3128/ HTTPS_PROXY=http://127.0.0.1:3128/ http_proxy=127.0.0.1:3128 curl https://quay.io/v2/calico/cni/blobs/sha256:2c72d89b5a7fa12faeb77e0fccde73303b21ba3ca65ed614dfb1a8725c5a339f -k -vvv --location
The final call gives me:
// ...
< HTTP/1.1 403 Forbidden
< Server: nginx/1.14.0
< Date: Mon, 06 Jan 2020 08:44:12 GMT
< Content-Type: text/xml
< Content-Length: 110
< Connection: keep-alive
< X-Cache: Error from cloudfront
< Via: 1.1 b2ce71f6c09ab30df63d53e155a1cded.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LHR61-C1
< X-Amz-Cf-Id: Eudz6NVoqp8lm7Vz-H2uYgqyM18Iv_j72w5BMV_sWdh9N4IOPqvt_w==
<
* Connection #1 to host 127.0.0.1 left intact
<?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access denied</Message></Error>* Closing connection 0
* Closing connection 1
In a normal flow, the redirect response and the binary blob are cached.
In the scenario where it's returning Access denied
the redirect response is still there, but the binary blob is removed.
So now I wonder, how was it possible that I ended up in this scenario in production. After we restarted the docker-mirror the problem appeared within hours. So we don't think the cache limit is the problem. As the cache was fresh and we don't pull that many different blobs.
Could it be that the TTL for the redirect and the binary blob are different? Shouldn't the redirect blobs always take a lower TTL than the binary blobs?
Able to reproduce the exact error
docker run -d --rm --name docker_registry_proxy -it -p 0.0.0.0:3128:3128 -v $(pwd)/docker_mirror_cache:/docker_mirror_cache -v $(pwd)/docker_mirror_certs:/ca -e REGISTRIES="k8s.gcr.io gcr.io quay.io your.own.registry another.public.registry" -e AUTH_REGISTRIES="auth.docker.io:dockerhub_username:dockerhub_password your.own.registry:username:password" janarthananv/janaprox
Pulled multiple docker images for the volume
docker pull quay.io/ratpoison/oregano-test-ubuntu:master
docker pull quay.io/ratpoison/oregano-test-ubuntu
docker pull quay.io/bitnami/minideb-extras:jessie-r467-buildpack
docker pull quay.io/bitnami/minideb-extras:jessie-r465-buildpack
docker pull quay.io/ratpoison/oregano-test-fedora
docker pull quay.io/ratpoison/oregano-test-fedora:master
docker pull quay.io/spearow/machine-learning-container-fedora-default:latest
############### Cache volume size reaches the limit**
1.2G docker_mirror_cache/
############### Now try to pull the image**
root@ip-172-31-31-113:/home/ubuntu# docker pull quay.io/spearow/machine-learning-container-fedora-default:latest
latest: Pulling from spearow/machine-learning-container-fedora-default
d318c91bf2a8: Already exists
19cbf13613ce: Pulling fs layer
a2aea92c871d: Downloading [=================================================> ] 173.6MB/174MB
96e5654a5f16: Downloading
75572a21e283: Waiting
dfcef28d7f07: Waiting
365dd5894506: Waiting
9f217365329e: Waiting
ad2088683fdb: Waiting
error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Request has expired</Message><Expires>2020-01-16T15:21:16Z</Expires><ServerTime>2020-01-16T16:19:21Z</ServerTime><RequestId>EE4D644410EDE73E</RequestId><HostId>v2TSdujx4oN9je4Qs4KWJpdBEFtyhTVMguTYwBDaAofPndSrS/+AUGmXjQ3myp8zd1iS1/ouoJw=</HostId></Error>"
So, cache cleaning not worked. After cleaning the cache directory it worked! Fix for now use cronjob to clean the disk when its almost fulled
Attached logs from Nignx logs.txt
Thanks for the investigations.
Maybe it just really boils down to "mismatched cache eviction": Binary blobs were evicted, but redirects containing expired s3 credentials are still in the cache. Impossible to repopulate cache, and too late to get new signature from upstream.
Could be that the binary blobs being bigger are more prone to eviction due to cache size limits? nginx docs say otherwise (it removes the least recently used data, from http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path)
My suggestion is to reduce the TTL on the blob requests that get the signed redirect response.
location ~ ^/v2/(.*)/blobs/ {
# ...
proxy_cache_valid 301 302 307 595s; # s3 signed responses usually expire in 10 minutes, seems to be the case for quay.io
}
If the TTL expires, and the signature changes with the renewed request, we will still get a cache hit on the follow-up binary blob request: it already ignores request parameters/signature with the proxy_cache_key
directive.
Also useful would be to allow setting the cache size via ENV, by using the entrypoint to generate an include like done in other cases.
PRs are welcome, otherwise I'll handle this for the next release.
We discovered that Quay.io responses are cached with their Signed Cloudfront url's.