Closed BenTheElder closed 1 year ago
Opened #638 to log timestamps before each step of the promotion process.
/cc @kubernetes-sigs/release-engineering
OK, we have timestamps: https://prow.k8s.io/log?job=post-promo-tools-image-promo-canary&id=1582627370186051584
curl 'https://prow.k8s.io/log?job=post-promo-tools-image-promo-canary&id=1582627370186051584' | grep 'kpromo\['
level=info msg="kpromo[1666162785]: PromoteImages start"
level=info msg="kpromo[1666162785]: Parsing manifests"
level=info msg="kpromo[1666162785]: Creating sync context manifests"
level=info msg="kpromo[1666162785]: Getting promotion edges"
level=info msg="kpromo[1666162786]: Creating producer function"
level=info msg="kpromo[1666162786]: Validating staging signatures"
level=info msg="kpromo[1666162806]: Promoting images"
level=info msg="kpromo[1666162818]: Replicating signatures"
level=info msg="kpromo[1666162820]: Signing images"
level=info msg="kpromo[1666162842]: Finish"
A note about these times. These are from a run of our canary job. It is a test promotion using a canary image of the promoter we build at head. The canary job promotes an image to two mirrors that have almost nothing in them. But we can see the latency of the promotion plus the signing operations:
A few notes here:
Looking at the issue "Validating signatures from staging: 20 secs" mentioned in the comment above. The method we measure is ValidateStagingSignatures
:
It's suspicious that we create a signer in ValidateStagingSignatures
as well as FindSingedEdges
, especially the multi use of the throttlers in combination with the signing API is kinda hard to debug:
I propse to start with some more verbose logging and tracing the timestamps between log messages in https://github.com/kubernetes-sigs/promo-tools/pull/640
signer.IsImageSigned
is indeed slow (~1s per image). I was able to trace the path as following:
IsImageSigned
(k/release-sdk) SignedEntity
(k/relase-sdk)SignedEntity
(cosign)remoteGet
(cosign)Get
(go-containerregistry)get
(go-containerregistry)makeFetcher
(go-containerregistry)makeFetcher
(go-containerregistry)NewWithContext
(go-containerregistry)That's where it gets interesting:
ping
definitely takes some timebt.refresh()
callI added some traces around that logic in NewWithContext
in the same way as introduced in #640, with the output:
INFO[11:53:37.348] Check if image is signed diff=0s
INFO[11:53:37.349] NewWithContext start diff=0s
INFO[11:53:37.349] ping start diff=0s
INFO[11:53:37.810] ping end diff=462ms
INFO[11:53:37.810] bt.refresh start diff=0s
INFO[11:53:38.283] bt.refresh end diff=473ms
INFO[11:53:38.283] NewWithContext end diff=0s
INFO[11:53:38.450] NewWithContext start diff=167ms
INFO[11:53:38.450] ping start diff=0s
INFO[11:53:38.564] ping end diff=113ms
INFO[11:53:38.564] bt.refresh start diff=0s
INFO[11:53:38.704] bt.refresh end diff=140ms
INFO[11:53:38.704] NewWithContext end diff=0s
INFO[11:53:38.839] Is signed: false diff=135ms
The first call to NewWithContext
seems to be the critical path. Created a tracking issue in go-containerregistry about that topic: https://github.com/google/go-containerregistry/issues/1466
It makes sense that the first call to setup the transport would need to negotiate auth with the specific registry.
On mobile: are we able to supply and reuse transports on our end? We should probably setup one per worker goroutine and reuse.
Looks like up through:
We could plumb a WithTransport() (https://github.com/google/go-containerregistry/issues/1466#issuecomment-1284315561) but not:
Working on a faster version of image signature verification in https://github.com/kubernetes-sigs/release-sdk/pull/123
We also need a better release-sdk API for the usage of VerifyImage
:
The method again checks if the image is signed (not required in our case) as well as does not reuse the transport.
Edit: Some ideas are now in https://github.com/kubernetes-sigs/release-sdk/issues/124
I think one of the recent changes upped failed runs
https://prow.k8s.io/?job=*image-promo
We have failures like this:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1583138923981312000
time="17:17:43.175" level=fatal msg="run
cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fe2e-test-images%2Fvolume%2Frbd%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'." diff=97ms
IIRC Per jon the quota is pretty reasonable so 🤔 (forgetting the hard number, I asked)
Current quota consumption for Artifact Registry
https://cloud.google.com/artifact-registry/quotas
60000 requests per minute in each region or multi-region.
1000 qps per region ... what are we doing that is hitting > 1000 qps in a region?
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo
besides the sample above:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582921237355565056
time="03:02:13.916" level=fatal msg="run
cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: Get \"https://us-central1-docker.pkg.dev/v2/\": read tcp 10.4.2.155:33274->142.250.152.82:443: read: connection reset by peer" diff=2ms
We just need to tolerate connection blips probably, in that case.
yesterday:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582781607704530944
level=fatal msg="run
cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fexperimental%2Fconformance-ppc64le%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'."
before that, last tuesday:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1582405078290010112
level=fatal msg="run cip run
: filtering edges: filtering promotion edges: reading registries: getting tag list: GET https://us-east1-docker.pkg.dev/v2/token?scope=repository%3Ak8s-artifacts-prod%2Fimages%2Fautoscaling%2Fvpa-recommender-s390x%3Apull&service=: TOOMANYREQUESTS: Quota exceeded for quota metric 'Requests per project per region' and limit 'Requests per project per region per minute per region' of service 'artifactregistry.googleapis.com' for consumer 'project_number:388270116193'."
us-east1 again
The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too :sweat:
The promoter will look for every image, in every mirror to see how the target registry compares to the source. I think that is it, we have a ton of images and the new mirrors multiplied those lookups about 9x which is why the logs now are humongous too 😓
This quota is per region though, so we have to be exceeding 1000 qps in one region
registry.k8s.io users are consuming a non-zero amount, and that will increase with time, but in us-east1 it's currently only ~10 qps even reaching registry.k8s.io and not all of that will reach AR.
Yeah I assume we contact the registry multiple times per image which should be clearly improved. But for example for checking if a signature exist we already need 2 requests: One for the digest of the main image and one for the signature digest.
Some enhancements regarding the signature verification got merged in release-sdk and promo tools. I still have some ideas how to improve the signing part.
Amazing, I will kick off a canary run later today and we can see it in action. Thank you Sascha, see you soon! :tada: :smile:
I think we can close this now, especially since we have https://github.com/kubernetes-sigs/promo-tools/issues/662. There could be more room for improvement like reducing the amount of registries.
We should still check for effective resource utilization and either pursue further tuning or go ahead and reduce the resources configured for the job as we request a whole node but were barely using it
We should still check for effective resource utilization and either pursue further tuning or go ahead and reduce the resources configured for the job as we request a whole node but were barely using it
Good, can you provide us another set of data for, let's say, the patch promotions next week?
Can get a current sample as soon as I'm back at my desk, and thank you for working on this
Sorry .... today did not turn out as expected, I should have known. So anyhow:
https://prow.k8s.io/?job=*image-promo
=> current run => https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1588044493649612800
=> prowjob yaml link =>
status.pod_name in the prowjob gives us 9fb02c3f-5b3a-11ed-a031-bac5f1dea348
=> let's go check cloud console and find that pod in the k8s infra cluster => https://console.cloud.google.com/kubernetes/list/overview?mods=logs_tg_staging&project=k8s-infra-prow-build-trusted
=> workloads tab, filter by 9fb02c3f-5b3a-11ed-a031-bac5f1dea348
(could instead auth to the cluster and get the pod with kubectl) =>
https://console.cloud.google.com/kubernetes/pod/us-central1/prow-build-trusted/test-pods/9fb02c3f-5b3a-11ed-a031-bac5f1dea348/details?mods=logs_tg_staging&project=k8s-infra-prow-build-trusted&pageState=(%22savedViews%22:(%22i%22:%22073c6a714dc44e0d94ce168486658f14%22,%22c%22:%5B%5D,%22n%22:%5B%5D))
=> Node gke-prow-build-trust-trusted-pool1-20-4044b2cc-vqho, let's go find the VM by that name, paste 20-4044b2cc-vqho
in the search box up top, click on the VM result =>
https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-b/instances/gke-prow-build-trust-trusted-pool1-20-4044b2cc-vqho?q=search&referrer=search&project=k8s-infra-prow-build-trusted&mods=logs_tg_staging
Now if we look at the observability tab:
We can take this shortcut because we know there aren't other workloads on this machine besides the system agents, given the job / pod resource requests.
It looks like we still don't use a lot, so we should probably consider turning back down the changes I made to request lots of resources for this pod.
In the meantime we're making quick push-after-promo-PR and quick-promot-presubmit-testing thanks to @saschagrunert's changes to leverage the CI commit details, so I don't think it's pressing to get a super fast 8 core full-resync yet.
(note that that spike is around when the job started, and then it settles down to around 4% CPU and 0.66MiB/s outbound traffic) (also note that the CPU utilization % metric is relative to the total vCPUs on the VM, not to one core)
@BenTheElder agree, so would it be enough to change the resource
requests
/limits
to match something lower?
What about the memory utilization?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
When running in Kubernetes CI image promotion takes more than one hour, often multiple hours.
Worse, it takes an hour just running presubmit tests in k8s.io repo. As a result a quickly approved image promotion PR may take 3 hours or more to take affect.
I took a cursory pass at this by: https://github.com/kubernetes/test-infra/pull/27743 + https://github.com/kubernetes/test-infra/pull/27765 which set large (7 core, 40 Gi RAM) allocations, configure GOMAXPROCS to match, and increase
--threads
(worker goroutines).Since those changes, we can follow from a running job, to it's pod, to the node the pod is on, to the backing GCE VM. On that VM, there should only be kubelet / system daemons, other workloads will not be scheduled here due to essentially no schedule-able resources being available (VM has 8 cores, we request 7, some are reserved for system agents).
We can see that CPU usage spikes when the job starts (git cloning?) and then the entire machine settles to just 3-7% utilization, outgoing connections only ~2/s to ~3/s, disk write < 0.6MiB/s, network egress < 0.7MiB/s far below the initial spike:
This implies we are bottlenecking heavily on something like waiting for the sigstore API ...?
Copying images could otherwise conceivably bottleneck on CPU or Network but both of those are hardly in use.
I half-joked about synchronized logging and the volume of logs, but I don't seriously think it's contention over the logger.
Besides investigating what exactly is causing this slow-down, I think we should consider enabling running N postsubmit jobs each responsible for a subset of the images, only triggered when the respective manifest directories change, based on
run_if_changed
plus some option to the image promoter to specify which manifests to read.