goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
https://goharbor.io
Apache License 2.0
23.86k stars 4.74k forks source link

Harbor proxy cache fails pull instead of returning locally cached copy when upstream returns 429 #19899

Open ricardojdsilva87 opened 8 months ago

ricardojdsilva87 commented 8 months ago

Hello, This is a reopening of the ticket

Recently we have been hit with several 429 Too Many Requests Error coming from Harbor. After digging in we saw that all of these errors came from the docker proxy registry that we have set against docker.io to serve as a cache for docker.io container images.

All of these errors happen only for images that rely on tags that might be updated very often, like latest, or even alpine3.15 for example. In this case it was for amazoncorretto.

The error shown is caused by a protection on DockerHub that does not rely on authenticated users like the pull limit. It seems that this blockage can happen at any time an is binded only by IP, the DockerHub infrastructure blocks these calls if it seems fit.

We use Datadog to monitor our infrastrucure and we can see that there are some of these errors happening from time to time image

And the error logs from Harbor: image

This is an issue because it seems that Harbor does not serve the cached image/tag/layer even if the check against DockerHub fails. This causes the container to be unable to run because it cannot download the image. We suspect that this might happen also for other tags other than latest, etc, since Harbor needs to check if the cached layer is still the same on DockerHub.

My question is if there is some kind of protection on Harbor that we could enable, like for example:

If these options cannot be configured via helm-chart/ UI this would be a nice feature to implement on the core itself. We might see the same behaviour happening for other services like quay.io or even gcr.io in the future if they implement the same protection feature per IP.

Thanks for the support

stonezdj commented 8 months ago

Current implementation, it will return 429 to the client when the upstream registry response 429. For the sack of stability, we need some enhancement to the fix: https://github.com/goharbor/harbor/pull/18750 to allow user to setup a timeframe to skip the check of the manifest in the upstream registry.

ricardojdsilva87 commented 8 months ago

@stonezdj thanks for the update. We are still seeing some 429 requests but it seems that the container images are able to be used and not crash. We'll keep monitoring. Adding a cache timeout parameter would be nice also like you mentioned. Thanks!

strowi commented 2 weeks ago

Just ran into a similar issue with proxying the trivy-db from ghcr.io. It would be really helpful if this could be cirumvented somehow on the harbor-side.