Closed spiffxp closed 3 years ago
/assign @BenTheElder /sig testing /priority important-soon We should survey soon and evaluate priority based on that.
/area images /area jobs for changing jobs /area prow for the idea of a pull-through cache
If all the images used are here
There are only 8 images, it seems easy to move them to gcr.io
configs[BusyBox] = Config{dockerLibraryRegistry, "busybox", "1.29"}
configs[GlusterDynamicProvisioner] = Config{dockerGluster, "glusterdynamic-provisioner", "v1.0"}
configs[Httpd] = Config{dockerLibraryRegistry, "httpd", "2.4.38-alpine"}
configs[HttpdNew] = Config{dockerLibraryRegistry, "httpd", "2.4.39-alpine"}
configs[Nginx] = Config{dockerLibraryRegistry, "nginx", "1.14-alpine"}
configs[NginxNew] = Config{dockerLibraryRegistry, "nginx", "1.15-alpine"}
configs[Perl] = Config{dockerLibraryRegistry, "perl", "5.26"}
configs[Redis] = Config{dockerLibraryRegistry, "redis", "5.0.5-alpine"}
@aojea yeah we should probably start there, and mirror these images used as part of e2e tests into k8s.gcr.io (FYI @dims @thockin). I would probably put them in the same repo we're putting other e2e test images
I took a swag at images pulled by kublets during a run of https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default. It's shell, and I haven't verified whether the specific log line(s) this pulls out represent dockerhub's definition of a pull, or whether it's just the first "real" pull that counts.
$ gsutil cp -r gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce/1313838937554292736/artifacts e2e-gce-default
$ ag pulled | sed -e 's/.*message=.*image \\"\([^ ]*\)\\".*/\1/' | grep -v 'rejected\|recycler' | sort | uniq -c | sort -rg | grep -v gcr.io
488 docker.io/library/busybox:1.29
143 docker.io/library/httpd:2.4.38-alpine
69 docker.io/library/nginx:1.14-alpine
16 docker.io/library/httpd:2.4.39-alpine
1 docker.io/gluster/glusterdynamic-provisioner:v1.0
Here's what kubelet pulls for each node in the k8s-infra-prow-build cluster.
$ gcloud logging read
'resource.type="k8s_pod" jsonPayload.message=~"Successfully pulled image"'
--project=k8s-infra-prow-build \
--freshness=7d \
--format="value(jsonPayload.message)" \
| tee k8s-infra-prow-build-node-pulls.txt
$ <k8s-infra-prow-build-node-pulls.txt grep -v gcr.io | cut -d\ -f4 | sort -rg | uniq -c
2094 "alpine:3.6" # averages to 74.7 per 6h
This doesn't catch pulls that are done as part of jobs that run on nodes, still looking...
Looks like we could stand to mirror golang, node, python and alpine too.
$ cd ~/w/kubernetes/test-infra/config/jobs
$ ag image: | sed -e 's/.*image: //g' | sort | uniq -c | sort | grep -v gcr.io
39 golang:1.13
7 node:12
7 ${dind_image}
6 golang:1.15
5 golang:1.14
4 python:3.7
4 python:3.6
4 golang:1.12
4 alpine
3 golang:1.15.2
3 golang:1.15.0
3 golang:1.11.5
2 {{e2e_image}}
2 golang:1.13.10
2 golang:1.12.5
1 quay.io/kubespray/kubespray:v2.13.0
1 quay.io/kubermatic/yamllint:0.1
1 quay.io/k8s/release-tools:latest
1 quay.io/k8s/release-tools-centos:latest
1 python:3.8
1 python:3.5
1 python:3
1 python:2.7
1 praqma/linkchecker:v9.3.1-154-g22449abb-10
1 pouchcontainer/pouchlinter:v0.1.2
1 kubernetes/kops/build-grid.py:270: if kops_image:
1 golangci/golangci-lint:v1.29.0
1 golangci/golangci-lint:v1.26
1 golangci/golangci-lint:v1.21.0
1 golangci/golangci-lint:latest
1 golang:1.14.4
1 golang:1.14.1
1 golang:1.11
1 docker.io/haskell:8.6.5
1 cypress/base:12.1.0
My concern is we probably don't want to end up paying to serve these much more commonly used images to non-kubernetes projects.
These might be available on mirror.gcr.io https://cloud.google.com/container-registry/docs/pulling-cached-images
Note that the most popular images are in mirror.gcr.io which our CI nodes and our clusters should be generally using*
* not kind, and possibly not docker in docker. the latter is easy to move to this if it's not already.
Something like glusterdynamic-provisioner is most likely not in it.
@spiffxp @claudiubelu Turns out busybox
and golang
are in mirror.gcr.io:
gcloud container images list --repository=mirror.gcr.io/library
NAME
mirror.gcr.io/library/alpine
mirror.gcr.io/library/bash
mirror.gcr.io/library/buildpack-deps
mirror.gcr.io/library/busybox
mirror.gcr.io/library/centos
mirror.gcr.io/library/chronograf
mirror.gcr.io/library/consul
mirror.gcr.io/library/couchdb
mirror.gcr.io/library/debian
mirror.gcr.io/library/docker
mirror.gcr.io/library/elasticsearch
mirror.gcr.io/library/flink
mirror.gcr.io/library/ghost
mirror.gcr.io/library/golang
mirror.gcr.io/library/haproxy
mirror.gcr.io/library/hello-world
mirror.gcr.io/library/httpd
mirror.gcr.io/library/kong
mirror.gcr.io/library/mariadb
mirror.gcr.io/library/matomo
mirror.gcr.io/library/maven
mirror.gcr.io/library/memcached
mirror.gcr.io/library/mongo
mirror.gcr.io/library/mongo-express
mirror.gcr.io/library/mysql
mirror.gcr.io/library/nginx
mirror.gcr.io/library/node
mirror.gcr.io/library/openjdk
mirror.gcr.io/library/percona
mirror.gcr.io/library/perl
mirror.gcr.io/library/php
mirror.gcr.io/library/postgres
mirror.gcr.io/library/python
mirror.gcr.io/library/rabbitmq
mirror.gcr.io/library/redis
mirror.gcr.io/library/ruby
mirror.gcr.io/library/solr
mirror.gcr.io/library/sonarqube
mirror.gcr.io/library/telegraf
mirror.gcr.io/library/traefik
mirror.gcr.io/library/ubuntu
mirror.gcr.io/library/vault
mirror.gcr.io/library/wordpress
mirror.gcr.io/library/zookeeper
gcloud container images list-tags mirror.gcr.io/library/busybox
DIGEST TAGS TIMESTAMP
c9249fdf5613 latest 2020-10-14T12:07:34
2ca5e69e244d 2020-09-09T03:38:02
fd4a8673d034 1.31,1.31.1 2020-06-02T23:19:57
dd97a3fe6d72 1.31.0 2019-09-04T21:20:16
e004c2cc521c 1.29 2018-12-26T09:20:43
@spiffxp @claudiubelu Turns out
busybox
andgolang
are in mirror.gcr.io:
Interesting. But we actually have a new type of issue I didn't think about at the meeting: support for multiple architecture types. And it seems like I'm right:
docker run --rm mplatform/mquery mirror.gcr.io/library/busybox:1.29
Unable to find image 'mplatform/mquery:latest' locally
latest: Pulling from mplatform/mquery
db6020507de3: Pull complete
f11a2bcbeb86: Pull complete
Digest: sha256:e15189e3d6fbcee8a6ad2ef04c1ec80420ab0fdcf0d70408c0e914af80dfb107
Status: Downloaded newer image for mplatform/mquery:latest
Image: mirror.gcr.io/library/busybox:1.29
* Manifest List: No
* Supports: amd64/linux
mirror.gcr.io/library/busybox:1.29
is not a manfiest list, unlike the dockerhub counterpart. It's just a linux/amd64 image. This means that we'll still have problems for people trying to test other architecture types. I know that someone is interested in running s390x. @rajaskakodkar
support for multiple architecture types
naive question, do you think that those CIs in anothers architectures can hit dockerhub limits?
support for multiple architecture types
naive question, do you think that those CIs in anothers architectures can hit dockerhub limits?
Good question, actually.
Doing a grep in test-infra/config/jobs, I can see:
Then, there are these boards:
https://testgrid.k8s.io/conformance-arm https://testgrid.k8s.io/conformance-ppc64le https://testgrid.k8s.io/conformance-s390x https://testgrid.k8s.io/sig-node-arm64 https://testgrid.k8s.io/sig-node-ppc64le
The conformance- boards have periodic jobs. Not sure if sig-node- jobs are periodic, but they're all running conformance tests.
But from the docker images you've listed, not all of them are commonly used:
[sig-storage] Dynamic Provisioning [k8s.io] GlusterDynamicProvisioner should create and delete persistent volumes [fast]
, which only runs on gke.That leaves us with 5 images that are being used in most conformance runs: busybox, nginx, nginx-new, httpd, httpd-new. It also depends on how many nodes are in cluster: if there are 2 nodes in the cluster, the images will be pulled twice (we use 2 nodes for Windows test runs, for example).
After that, we'd have to take a look at all the image building jobs. There are quite a few of them unfortunately: https://cs.k8s.io/?q=BASEIMAGE&i=nope&files=&repos= . Typically, those are postsubmit jobs and they don't run that often, but I'm suspecting that there's a higher chance for multiple jobs to run at the end of a cycle.
We could switch to gcr.io base images?
After that, we'd have to take a look at all the image building jobs
These all end up invoking GCB builds which is where the image pulls would happen. I am less concerned about these, mostly because I doubt they run at a volume that would cause any rate limiting even if they all theoretically ran on the same instance.
I am more concerned about the long tail of jobs on prow's build cluster causing a specific node to get rate-limited.
After that, we'd have to take a look at all the image building jobs
These all end up invoking GCB builds which is where the image pulls would happen. I am less concerned about these, mostly because I doubt they run at a volume that would cause any rate limiting even if they all theoretically ran on the same instance.
I am more concerned about the long tail of jobs on prow's build cluster causing a specific node to get rate-limited.
That makes sense, that should be less concerning for us then. Although, GCB could have the same issues since we're not the only ones building images through GCB.
I've read the dockerhub FAQ, and I saw this:
Will Docker offer dedicated plans for open source projects?
Yes, as part of Docker’s commitment to the open source community, we will be announcing
the availability of new open source plans. To apply for an open source plan, complete our
application at: https://www.docker.com/community/open-source/application.
This should apply to us too, right? If so, have we applied yet?
In any case, we can configure the tests to use the mirror.gcr.io
images by default (which are only for linux/amd64), and configure the other architecture type jobs to use dockerhub for now. The tests can be configured to use different registries by having a KUBE_TEST_REPO_LIST
env variable [1], which will contain a yaml file specifying which registries to use. [2]
[1] https://github.com/kubernetes/test-infra/blob/882eb3f17e8e4f1344f7198ee161fb51ba471f2f/kubetest/aksengine.go#L1257 [2] https://github.com/kubernetes/test-infra/blob/b14c4896f1d3e14f504607efccf25e9916451e54/config/jobs/kubernetes-sigs/sig-windows/sig-windows-config.yaml#L15
The FAQ also mentions a pull-through cache mirror registry. However, it doesn't mention how it behaves with manifest lists. My first guess is that it doesn't handle manifest lists, and just pulls / caches the images for the platform it's on. This would mean that if we would try this option, we'd have different mirrors, one for earch architecture type we're currently testing.
We should not be referencing mirror.gcr.io directly but configuring docker and containerd to do so https://cloud.google.com/container-registry/docs/pulling-cached-images#configure
k8s-infra-prow-build nodes have containerd setup to use it:
spiffxp@gke-prow-build-pool4-2020082817590115-6b0f3325-c0r9:~$ cat /etc/containerd/config.toml
# ...snip...
[plugins.cri.registry.mirrors."docker.io"]
endpoint = ["https://mirror.gcr.io","https://registry-1.docker.io"]
as does k8s-prow-builds (aka the 'default' build cluster)
bentheelder@gke-prow-default-pool-cf4891d4-0178:~$ cat /etc/containerd/config.toml
# ... snip ...
[plugins.cri.registry.mirrors."docker.io"]
endpoint = ["https://mirror.gcr.io","https://registry-1.docker.io"]
it is unclear to me whether this is picked up by any pods that try to explicitly run docker commands
Clusters stood up using kube-up.sh also have this enabled by default: https://github.com/kubernetes/kubernetes/blob/ededd08ba131b727e60f663bd7217fffaaccd448/cluster/gce/config-default.sh#L163-L164
and while under test: https://github.com/kubernetes/kubernetes/blob/ededd08ba131b727e60f663bd7217fffaaccd448/cluster/gce/config-test.sh#L175-L176
which causes mirror.gcr.io to be set as a registry mirror url here: https://github.com/kubernetes/kubernetes/blob/ededd08ba131b727e60f663bd7217fffaaccd448/cluster/gce/util.sh#L313-L315
which is then setup as the registry mirror here: https://github.com/kubernetes/kubernetes/blob/ededd08ba131b727e60f663bd7217fffaaccd448/cluster/gce/gci/configure-helper.sh#L1470-L1475
we are pausing enforcement of the changes to image retention until mid 2021
https://www.docker.com/blog/docker-hub-image-retention-policy-delayed-and-subscription-updates/
Image retention isn't the concern here, but good to know. They're still planning to move forward with pull rate-limits
@claudiubelu mentioned this issue in the weekly SIG-Windows meeting last week. I work closely with the Docker Hub team and wanted to call out a few things here:
docker login
) with a DockerHub user ID and [ii] have the DockerHub user IDs exempted from the pull limits of authenticated users (200 pulls / 6 hr - rather than 100 pulls / 6 hr) through one of the following:
a) Subscribe each of the DockerHub IDs to an individual Pro Plan on Docker Hub ($60/year/DockerHub user ID).
b) Create a DockerHub Team and make the DockerHub IDs members of the Team (total: $300/year for first 5 DockerHub users and $7/mo for each additional DockerHub user).
c) Work with Docker Inc. to get an exception for the DockerHub Team in (b) so that members of the team are exempted from the pull rate-limit.Note that the GCE e2e clusters (created during testing) are generally using ephemeral IPs so we have no guarantee that a previous user wasn't performing many pulls.
I looked and I think the build clusters at least aren't NATed and the VMs are relatively long-lived.
There's also places other than our CI to consider, e.g. third party CI. We may not want to continue using images from dockerhub while mitigating for ourselves if others have to work around it too, long term. (vs. images on e.g. k8s.gcr.io where users are not limited)
Changes are progressively rolling out, and we are in wait-and-see mode at this point. To answer @ddebroy's questions / recap:
Per https://www.docker.com/increase-rate-limits
The rate limits will be progressively lowered to a final state of 100 container image requests per six hours for anonymous usage, and 200 container image requests per six hours for free Docker accounts. Image requests exceeding these limits will be denied until the six hour window elapses.
Temporary full enforcement window (100 per six hours for unauthenticated requests, 200 per six hours for free accounts): November 2, 9am-10am Pacific Time.
If we saw an increase in errors during this window, we should take note. It would mean we're 6x over the rate limit.
Per https://www.docker.com/blog/checking-your-current-docker-pull-rate-limits-and-status/
Requests to Docker Hub now include rate limit information in the response headers for requests that count towards the limit. These are named as follows:
- RateLimit-Limit
- RateLimit-Remaining
Annoyingly, this means that we don't know about the status of our quota unless we consume it.
Per https://www.docker.com/blog/expanded-support-for-open-source-software-projects/
We got great feedback from our extensive user base, and adjusted our policies to delay the policies on image retention until mid-2021.
For the approved, non-commercial, open source projects, we are thrilled to announce that we will suspend data pull rate restrictions, where no egress restrictions will apply to any Docker users pulling images from the approved OSS namespaces.
Does this mean this issue could not be a problem anymore? (for things like kind for instance?)
The OSS projects agreement has some strings attached:
Joint Marketing Programs
While the publisher retains the Open Source project status, the Publisher agrees to -
Become a Docker public reference for press releases, blogs, webinars, etc
Create joint blogs, webinars and other marketing content
Create explicit links to their Docker Hub repos, with no ‘wrapping’ or hiding sources of their images
Document that Docker Engine or Docker Desktop are required to run their whitelisted images
Give Docker full attribution
source: I applied (a non-kubernetes project) and got an email
Huh... thanks @howardjohn
Become a Docker public reference for press releases, blogs, webinars, etc
Sounds reasonable?
Create joint blogs, webinars and other marketing content
I think in our case that might require steering's input ...
Create explicit links to their Docker Hub repos, with no ‘wrapping’ or hiding sources of their images
Seems reasonable.
Document that Docker Engine or Docker Desktop are required to run their whitelisted images
... versus podman, cri-o, containerd etc.? 😕
Give Docker full attribution
Seems unclear. 🤔
/me wearing my own cap (not steering)
LOL. nice try docker!
@howardjohn Yeah I am waiting on an answer of whether the benefit we get for this cost is unlimited pulls of other images by our CI. We don't publish to dockerhub, and my read of the benefits was unlimited pulls of our images for other users.
@aledbf if every image we happen to pull from dockerhub has applied and been exempted from rate limiting, that may lessen the impact
Docker has removed the bottom 2 bullets on @howardjohn's post from all public communications, and replaced it with -
I think that this is a better description of the kind of support that we are looking for...
@spiffxp , I have your response on my to do list, and will try to get back to you before the end of today.
We've gotten reports from downstream users who don't use mirroring in their CI that kubernetes e2e tests are occasionally hitting rate limits. Opened https://github.com/kubernetes/kubernetes/issues/97027 to cover moving those images/tests to a community-owned registry
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
https://github.com/kubernetes/kubernetes/issues/97027 has probably addressed the bulk of this for kubernetes/kubernetes
We can enforce that jobs should be using k8s-staging or k8s.gcr.io images
Beyond that I think any piecemeal followup for specific sub projects should considered out of scope for this, unless anyone has suggestions
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
I think we've only seen very minor issues upstream in Kubernetes, unclear that we need to prioritize anything further here.
For downstream concerns, we will finish migrating any images used by e2e.test in kubernetes.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
/remove-lifecycle rotten /close Agree that anything further can be opened up as followup issues
@spiffxp: Closing this issue.
/milestone v1.21
Creating this as a placeholder based on discussion in Slack and yesterday's SIG Testing meeting. I'll sketch in what I remember but to defer to @BenTheElder for a plan
Dockerhub is going to rate-limit pulls starting Nov 1st. See https://www.docker.com/pricing/resource-consumption-updates
Pull limit is:
Ideas:
We will likely need to fan out audit/change jobs to all SIGs / subprojects.
I think to start with we should ensure merge-blocking kubernetes/kubernetes jobs are safe, since they represent significant CI volume