Closed MarkEWaite closed 1 month ago
FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.
The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:
You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
About www.jenkins.io (I'll focus on ATH on a second time):
The build scripts are pulling Docker Library (e.g. "Official") images which ARE subject to rate limit, contrary to the jenkins/*
and jenkinsciinfra/*
images
ci.jenkins.io agents only have 2 outbound IPs as per https://github.com/jenkins-infra/azure-net/blob/6637c0b38bf0614335375f92c385a3da452e45e0/gateways.tf#L91
The DockerHub documentation tells us that anonymous pulls are limited to 100 pulls per 6 hours per IP, so ~200 pulls (1 pull == 1 layer OR 1 manifest)
The pipeline at https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile never uses any kind of credential to log in to DockerHub (and increasing the available rate limit)
=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the docker login
and push the limit forward
FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.
@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath
image itself)? The 2 days ago PRs have their build logs already purged, but I see some tests failures on the master branch (build https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/1178/) that might be related.
I'm asking to think about an eventual ACP-like for Docker Engine with a "pull-through" cache as per https://docs.docker.com/docker-hub/mirror/ for ci.jenkins.io
The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:
You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
About www.jenkins.io (I'll focus on ATH on a second time):
* The build scripts are pulling Docker Library (e.g. "Official") images which ARE subject to rate limit, contrary to the `jenkins/*` and `jenkinsciinfra/*` images * https://github.com/jenkins-infra/jenkins.io/blob/387ad60415c543e440cc8be06934559227ab251e/scripts/ruby#L13 * https://github.com/jenkins-infra/jenkins.io/blob/387ad60415c543e440cc8be06934559227ab251e/scripts/node#L14 * etc. * ci.jenkins.io agents only have 2 outbound IPs as per https://github.com/jenkins-infra/azure-net/blob/6637c0b38bf0614335375f92c385a3da452e45e0/gateways.tf#L91 * The DockerHub documentation tells us that anonymous pulls are limited to 100 pulls per 6 hours per IP, so ~200 pulls (1 pull == 1 layer OR 1 manifest) * The pipeline at https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile never uses any kind of credential to log in to DockerHub (and increasing the available rate limit)
=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the
docker login
and push the limit forward
@MarkEWaite : https://github.com/jenkins-infra/jenkins.io/pull/7421
FYI it broke an jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.
@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the
jenkins/ath
image itself)? The 2 days ago PRs have their build logs already purged, but I see some tests failures on the master branch (build ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/1178) that might be related.
Likely the same issue from a quick look, the actual build logs of the docker image aren't archived though
Can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)?
Yes, this was a rate limit error while fetching containers for use during tests. I didn't encounter any problems building or fetching the jenkins/ath
image itself.
Thanks @basil @timja !
I've opened https://github.com/jenkinsci/acceptance-test-harness/pull/1634 to set up authenticated Docker Engine during tests
Closing as:
Thanks folks!
Reopening as we saw a collection of 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
errors in the builds of jenkinsci/docker-agent and jenkinsci/docker on ci.jenkins.io in the past hour.
Reopening as we saw a collection of
429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
errors in the builds of jenkinsci/docker-agent and jenkinsci/docker on ci.jenkins.io in the past hour.
A solution to limit this kind of impact would be for us to run registry pull through caches (see. https://docs.docker.com/docker-hub/mirror/) in the ci.jenkins.io agent networks (all VMs and Linux containers)
@basil @timja I'm continuing the discussion from https://github.com/jenkinsci/acceptance-test-harness/pull/1640#issuecomment-2260161080 but here:
I'm not sure how to identify the failure error, I would need help navigating the ATH build and test results. With this I should be more autonomous to find failures, understand them and provide solutions.
toomanyrequest
(or HTTP/429) errors as all ci.jenkins.io agents share the same outbound IP
jenkinsciinfra/*
(or jenkins/*
) DockerHub organization to make them avoiding the rate limit (as these 2 DockerHub organization do not enforce rate limit for there images as per the sponsorship)?Would that be ok if we copy these images under a jenkinsciinfra/ (or jenkins/) DockerHub organization to make them avoiding the rate limit (as these 2 DockerHub organization do not enforce rate limit for there images as per the sponsorship)?
I would rather we didn't do that, unless tags are automatically imported as otherwise it'll be a maintenance nightmare for updating these images.
Could we use a mirror instead? e.g. Azure Container Registry can automatically mirror images that are configured.
I would need help navigating the ATH build and test results.
I think the docker build log is currently not:
Likely needs an improvement done so that the exact failure it visible on ci.jenkins.io
Could we use a mirror instead? e.g. Azure Container Registry can automatically mirror images that are configured.
🤔 don't the ACR also need to be kept updated (ref. https://learn.microsoft.com/en-us/azure/container-registry/buffer-gate-public-content?tabs=azure-cli#import-images-to-an-azure-container-registry) ?
Or did you meant the Artifact Caching thing (https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal) ? I understand we might want to set up the Docker Engine to use the ACR as pull through cache as mentioned on our comments above: https://docs.docker.com/docker-hub/mirror/ ?
Yes the artifact cache feature, it was introduced because of the docker rate limit exception for Azure running out.
Actually it does look like they are archived:
You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
All the failed tests in that build seem to have the same error
Actually it does look like they are archived:
You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
All the failed tests in that build seem to have the same error
Oh, thanks!
This error does not look like a rate limit 😅 are these commands sharing the docker.sock inside containers ? (The message looks like a non root user trying to execute docker commands)
Yes it is setup as DinD
Yes it is setup as DinD
🤔 what is the reason to use nested containers?
(btw DinD is a nightmare to configure regarding docker login
but at least it explains why my PR did not look to work as it only sets up the outer Docker engine.
Yes it is setup as DinD
I see https://github.com/jenkinsci/acceptance-test-harness/blob/4904fec29f49dedca64214757f8a7898ffa9a329/ath-container.sh#L37 and it looks like it is not DinD (e.g. nested container engine) but DonD (Docker on Docker, e.g. sharing the socket) is my understanding correct? => I'm not sure if the authentication is expected to work though (as it might be on the client side).
Yes your understanding is correct.
I'm not sure either would need testing
Yes your understanding is correct.
I'm not sure either would need testing
If it is DonD, then the ACR will be a good solution as the cache through setup is on engine side \o/
Actually it does look like they are archived:
You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
All the failed tests in that build seem to have the same error
But I fail to see the relation between these errors and the rate limit :|
Might be this fix that was just pushed: https://github.com/jenkinsci/acceptance-test-harness/pull/1645/commits/04f64ef0f7a13ca4ae7c5fe28be31e51b434e07a
Might be this fix that was just pushed: jenkinsci/acceptance-test-harness@04f64ef
Ow yeah, this change might fix it!
#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.19kB done
#1 DONE 0.0s
#2 [internal] load metadata for docker.io/library/debian:bullseye
#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:7aef2e7d061743fdb57973dac3ddbceb0b0912746ca7e0ee7535016c38286561: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
------
> [internal] load metadata for docker.io/library/debian:bullseye:
------
Dockerfile:2
--------------------
1 | # Sets up
2 | >>> FROM debian:bullseye
3 |
4 | # Viewvc is not part of bullseye repos anymore but oldstable https://github.com/viewvc/viewvc/issues/310
--------------------
ERROR: failed to solve: debian:bullseye: failed to resolve source metadata for docker.io/library/debian:bullseye: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:7aef2e7d061743fdb57973dac3ddbceb0b0912746ca7e0ee7535016c38286561: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.19kB done
#1 DONE 0.0s
#2 [internal] load metadata for docker.io/library/debian:bullseye
#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:907e428c7d1dd4e3a2458d22da8193e69878d3a23761d12ef9cd1a1238214798: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
------
> [internal] load metadata for docker.io/library/debian:bullseye:
------
Dockerfile:2
--------------------
1 | # Sets up
2 | >>> FROM debian:bullseye
3 |
4 | # Viewvc is not part of bullseye repos anymore but oldstable https://github.com/viewvc/viewvc/issues/310
--------------------
ERROR: failed to solve: debian:bullseye: failed to resolve source metadata for docker.io/library/debian:bullseye: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:907e428c7d1dd4e3a2458d22da8193e69878d3a23761d12ef9cd1a1238214798: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
Thanks @basil for the details. In order to tackle down these HTTP/429, I propose this course of actions:
docker login
pipeline steps on ATH=> once these setup are in place, we'll look at the result
@dduportal and I got the ACR option working and have tested on ci.jenkins.io.
@dduportal is going to finish off the terraform automation and update the jcasc config.
It looks like our users aren't rate limited and were probably hitting some anti-abuse protection, this should help with that and is expected to get rid of any rate limiting issues.
It will also mean that anything on ci.jenkins.io Azure doesn't need to login anymore, as the docker daemons are going to have a mirror-registry set to point it at the acr cache
Update:
We can now roll back the docker login
steps, along with all the "pullonly" logic
All changes have been reverted, but I'll keep this issue opened until the 13 in case we see other issues
Update:
My only concern is that some images or tags are still absent unless we explicitly docker pull
them with a ci.jenkins.io pipeline replay. I don't see any errors in the Docker Engine logs so I guess there might be a slight delay between an initial request to a new image reference (which fails as it's not cached yet) so Docker CE falls back to the DockerHub, and the second try once the "ACR cache rule" routine has collected the image tags in the ACR.
Does it make sense @timja ? Have you already seen this behavior in your own infrastructure?
Hmm not sure we use it slightly differently and explicitly use the cached version.
It won’t show up in the cache unless one pull has been completed: https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal#limitations
But if it’s increasing in size it’s definitely caching some
Closing as we did not had any more errors. Feel free to reopen if you see some
Service(s)
ci.jenkins.io
Summary
The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:
I've restarted the build in hopes that it will not hit the rate limit.
Reproduction steps