jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
16 stars 10 forks source link

Dockerhub rate limit broke the www.jenkins.io CI build #4192

Closed MarkEWaite closed 1 month ago

MarkEWaite commented 1 month ago

Service(s)

ci.jenkins.io

Summary

The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:

You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

I've restarted the build in hopes that it will not hit the rate limit.

Reproduction steps

  1. Open the ci.jenkins.io job and review the log file
basil commented 1 month ago

FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.

dduportal commented 1 month ago

The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:

You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

About www.jenkins.io (I'll focus on ATH on a second time):

=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the docker login and push the limit forward

dduportal commented 1 month ago

FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.

@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)? The 2 days ago PRs have their build logs already purged, but I see some tests failures on the master branch (build https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/1178/) that might be related.

I'm asking to think about an eventual ACP-like for Docker Engine with a "pull-through" cache as per https://docs.docker.com/docker-hub/mirror/ for ci.jenkins.io

dduportal commented 1 month ago

The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:

You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

About www.jenkins.io (I'll focus on ATH on a second time):

* The build scripts are pulling Docker Library (e.g. "Official") images which ARE subject to rate limit, contrary to the `jenkins/*` and `jenkinsciinfra/*` images

  * https://github.com/jenkins-infra/jenkins.io/blob/387ad60415c543e440cc8be06934559227ab251e/scripts/ruby#L13
  * https://github.com/jenkins-infra/jenkins.io/blob/387ad60415c543e440cc8be06934559227ab251e/scripts/node#L14
  * etc.

* ci.jenkins.io agents only have 2 outbound IPs as per https://github.com/jenkins-infra/azure-net/blob/6637c0b38bf0614335375f92c385a3da452e45e0/gateways.tf#L91

* The DockerHub documentation tells us that anonymous pulls are limited to 100 pulls per 6 hours per IP, so ~200 pulls (1 pull == 1 layer OR 1 manifest)

* The pipeline at https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile never uses any kind of credential to log in to DockerHub (and increasing the available rate limit)

=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the docker login and push the limit forward

@MarkEWaite : https://github.com/jenkins-infra/jenkins.io/pull/7421

timja commented 1 month ago

FYI it broke an jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.

@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)? The 2 days ago PRs have their build logs already purged, but I see some tests failures on the master branch (build ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/1178) that might be related.

Likely the same issue from a quick look, the actual build logs of the docker image aren't archived though

basil commented 1 month ago

Can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)?

Yes, this was a rate limit error while fetching containers for use during tests. I didn't encounter any problems building or fetching the jenkins/ath image itself.

dduportal commented 1 month ago

Thanks @basil @timja !

I've opened https://github.com/jenkinsci/acceptance-test-harness/pull/1634 to set up authenticated Docker Engine during tests

dduportal commented 1 month ago

Closing as:

Thanks folks!

dduportal commented 1 month ago

Reopening as we saw a collection of 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit errors in the builds of jenkinsci/docker-agent and jenkinsci/docker on ci.jenkins.io in the past hour.

Example: https://ci.jenkins.io/job/Packaging/job/docker-agent/job/PR-843/1/pipeline-console/?start-byte=0&selected-node=100#log-170

dduportal commented 1 month ago

Reopening as we saw a collection of 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit errors in the builds of jenkinsci/docker-agent and jenkinsci/docker on ci.jenkins.io in the past hour.

Example: https://ci.jenkins.io/job/Packaging/job/docker-agent/job/PR-843/1/pipeline-console/?start-byte=0&selected-node=100#log-170

https://github.com/jenkinsci/docker-agent/pull/844

dduportal commented 1 month ago

A solution to limit this kind of impact would be for us to run registry pull through caches (see. https://docs.docker.com/docker-hub/mirror/) in the ci.jenkins.io agent networks (all VMs and Linux containers)

dduportal commented 1 month ago

@basil @timja I'm continuing the discussion from https://github.com/jenkinsci/acceptance-test-harness/pull/1640#issuecomment-2260161080 but here:

I'm not sure how to identify the failure error, I would need help navigating the ATH build and test results. With this I should be more autonomous to find failures, understand them and provide solutions.

timja commented 1 month ago

Would that be ok if we copy these images under a jenkinsciinfra/ (or jenkins/) DockerHub organization to make them avoiding the rate limit (as these 2 DockerHub organization do not enforce rate limit for there images as per the sponsorship)?

I would rather we didn't do that, unless tags are automatically imported as otherwise it'll be a maintenance nightmare for updating these images.

Could we use a mirror instead? e.g. Azure Container Registry can automatically mirror images that are configured.


I would need help navigating the ATH build and test results.

I think the docker build log is currently not:

Likely needs an improvement done so that the exact failure it visible on ci.jenkins.io

dduportal commented 1 month ago

Could we use a mirror instead? e.g. Azure Container Registry can automatically mirror images that are configured.

🤔 don't the ACR also need to be kept updated (ref. https://learn.microsoft.com/en-us/azure/container-registry/buffer-gate-public-content?tabs=azure-cli#import-images-to-an-azure-container-registry) ?

Or did you meant the Artifact Caching thing (https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal) ? I understand we might want to set up the Docker Engine to use the ACR as pull through cache as mentioned on our comments above: https://docs.docker.com/docker-hub/mirror/ ?

timja commented 1 month ago

Yes the artifact cache feature, it was introduced because of the docker rate limit exception for Azure running out.

timja commented 1 month ago

Actually it does look like they are archived:

You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/

https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/attachments/docker-SshAgentContainer.build.log

ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied

All the failed tests in that build seem to have the same error

dduportal commented 1 month ago

Actually it does look like they are archived:

You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/

https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/attachments/docker-SshAgentContainer.build.log

ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied

All the failed tests in that build seem to have the same error

Oh, thanks!

This error does not look like a rate limit 😅 are these commands sharing the docker.sock inside containers ? (The message looks like a non root user trying to execute docker commands)

timja commented 1 month ago

Yes it is setup as DinD

dduportal commented 1 month ago

Yes it is setup as DinD

🤔 what is the reason to use nested containers?

(btw DinD is a nightmare to configure regarding docker login but at least it explains why my PR did not look to work as it only sets up the outer Docker engine.

dduportal commented 1 month ago

Yes it is setup as DinD

I see https://github.com/jenkinsci/acceptance-test-harness/blob/4904fec29f49dedca64214757f8a7898ffa9a329/ath-container.sh#L37 and it looks like it is not DinD (e.g. nested container engine) but DonD (Docker on Docker, e.g. sharing the socket) is my understanding correct? => I'm not sure if the authentication is expected to work though (as it might be on the client side).

timja commented 1 month ago

Yes your understanding is correct.

I'm not sure either would need testing

dduportal commented 1 month ago

Yes your understanding is correct.

I'm not sure either would need testing

If it is DonD, then the ACR will be a good solution as the cache through setup is on engine side \o/

dduportal commented 1 month ago

Actually it does look like they are archived:

You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/

https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/attachments/docker-SshAgentContainer.build.log

ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied

All the failed tests in that build seem to have the same error

But I fail to see the relation between these errors and the rate limit :|

timja commented 1 month ago

Might be this fix that was just pushed: https://github.com/jenkinsci/acceptance-test-harness/pull/1645/commits/04f64ef0f7a13ca4ae7c5fe28be31e51b434e07a

dduportal commented 1 month ago

Might be this fix that was just pushed: jenkinsci/acceptance-test-harness@04f64ef

Ow yeah, this change might fix it!

basil commented 1 month ago

https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1660/2/testReport/junit/plugins/LdapPluginTest/lts_linux_jdk17_firefox_split1___enable_cache/

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.19kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/debian:bullseye
#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:7aef2e7d061743fdb57973dac3ddbceb0b0912746ca7e0ee7535016c38286561: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
------
 > [internal] load metadata for docker.io/library/debian:bullseye:
------
Dockerfile:2
--------------------
   1 |     # Sets up
   2 | >>> FROM debian:bullseye
   3 |     
   4 |     # Viewvc is not part of bullseye repos anymore but oldstable https://github.com/viewvc/viewvc/issues/310
--------------------
ERROR: failed to solve: debian:bullseye: failed to resolve source metadata for docker.io/library/debian:bullseye: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:7aef2e7d061743fdb57973dac3ddbceb0b0912746ca7e0ee7535016c38286561: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.19kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/debian:bullseye
#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:907e428c7d1dd4e3a2458d22da8193e69878d3a23761d12ef9cd1a1238214798: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
------
 > [internal] load metadata for docker.io/library/debian:bullseye:
------
Dockerfile:2
--------------------
   1 |     # Sets up
   2 | >>> FROM debian:bullseye
   3 |     
   4 |     # Viewvc is not part of bullseye repos anymore but oldstable https://github.com/viewvc/viewvc/issues/310
--------------------
ERROR: failed to solve: debian:bullseye: failed to resolve source metadata for docker.io/library/debian:bullseye: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:907e428c7d1dd4e3a2458d22da8193e69878d3a23761d12ef9cd1a1238214798: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
dduportal commented 1 month ago

Thanks @basil for the details. In order to tackle down these HTTP/429, I propose this course of actions:

=> once these setup are in place, we'll look at the result

timja commented 1 month ago

@dduportal and I got the ACR option working and have tested on ci.jenkins.io.

@dduportal is going to finish off the terraform automation and update the jcasc config.

It looks like our users aren't rate limited and were probably hitting some anti-abuse protection, this should help with that and is expected to get rid of any rate limiting issues.

It will also mean that anything on ci.jenkins.io Azure doesn't need to login anymore, as the docker daemons are going to have a mirror-registry set to point it at the acr cache

dduportal commented 1 month ago

Update:

We can now roll back the docker login steps, along with all the "pullonly" logic

dduportal commented 1 month ago

All changes have been reverted, but I'll keep this issue opened until the 13 in case we see other issues

dduportal commented 1 month ago

Update:

Capture d’écran 2024-08-07 à 16 41 46

My only concern is that some images or tags are still absent unless we explicitly docker pull them with a ci.jenkins.io pipeline replay. I don't see any errors in the Docker Engine logs so I guess there might be a slight delay between an initial request to a new image reference (which fails as it's not cached yet) so Docker CE falls back to the DockerHub, and the second try once the "ACR cache rule" routine has collected the image tags in the ACR.

Does it make sense @timja ? Have you already seen this behavior in your own infrastructure?

timja commented 1 month ago

Hmm not sure we use it slightly differently and explicitly use the cached version.

It won’t show up in the cache unless one pull has been completed: https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal#limitations

But if it’s increasing in size it’s definitely caching some

dduportal commented 1 month ago

Closing as we did not had any more errors. Feel free to reopen if you see some