ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency

mborsz commented 1 year ago

I was reviewing performance of kube-apiserver in watchlist-off tests and found that the LIST latency in tests https://k8s-testgrid.appspot.com/sig-scalability-experiments#watchlist-off is around 40s:

I0609 08:33:55.589834      11 trace.go:236] Trace[2051566814]: "SerializeObject" audit-id:8a686097-a6c2-4869-9ff1-27f5b7a9dce5,method:GET,url:/api/v1/namespaces/watch-list-1/secrets,protocol:HTTP/2.0,mediaType:application/vnd.kubernetes.protobuf,encoder:{"encodeGV":"v1","encoder":"protobuf","name":"versioning"} (09-Jun-2023 08:33:11.330) (total time: 44259ms):
Trace[2051566814]: ---"Write call succeeded" writer:*gzip.Writer,size:304090734,firstWrite:true 43754ms (08:33:55.589)
Trace[2051566814]: [44.259564981s] [44.259564981s] END
I0609 08:33:55.589912      11 trace.go:236] Trace[1886779945]: "List" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:8a686097-a6c2-4869-9ff1-27f5b7a9dce5,client:35.226.210.156,protocol:HTTP/2.0,resource:secrets,scope:namespace,url:/api/v1/namespaces/watch-list-1/secrets,user-agent:watch-list/v0.0.0 (linux/amd64) kubernetes/$Format,verb:LIST (09-Jun-2023 08:33:11.328) (total time: 44260ms):
Trace[1886779945]: ---"Writing http response done" count:400 44259ms (08:33:55.589)
Trace[1886779945]: [44.260928281s] [44.260928281s] END
I0609 08:33:55.590150      11 httplog.go:132] "HTTP" verb="LIST" URI="/api/v1/namespaces/watch-list-1/secrets?limit=500&resourceVersion=0" latency="44.263133267s" userAgent="watch-list/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="8a686097-a6c2-4869-9ff1-27f5b7a9dce5" srcIP="35.226.210.156:45202" apf_pl="workload-low" apf_fs="service-accounts" apf_iseats=1 apf_fseats=0 apf_additionalLatency="0s" apf_execution_time="44.262028085s" resp=200

This is quite high for ~290M of data. From experience we usually observe 20-30MiB/s throughput on write due to compression, which should translate to more like ~10-15s latency.

I suspect we are lacking CPU or egress either on master or node side.

I'm not sure how important is it, but I'm afraid it may affect some benchmarks we are running.

/assign @p0lyn0mial

/cc @serathius

p0lyn0mial commented 1 year ago

Hey, thanks for the info! The first suspect is the test itself that runs on a worker node. The easiest way would be to increase the number of CPU for the test. Especially that nothing else is running on that machine. The second approach would be to monitor CPU and RAM usage from within the test itself. It would be more time-consuming and would likely require correlating CPU usage with the machine itself.

I think I will start with the first approach.

p0lyn0mial commented 1 year ago

What is interesting is that at some point we increased the number of the test replicas to 2 (https://github.com/kubernetes/perf-tests/pull/2281).

This change was reflected in CPU and RAM usage but not so much for the latency (screenshots).

Screenshot 2023-07-07 at 13 52 29

Screenshot 2023-07-07 at 13 52 52

Screenshot 2023-07-07 at 13 53 23

wojtek-t commented 1 year ago

/cc

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

p0lyn0mial commented 9 months ago

/remove-lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

p0lyn0mial commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 hours ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 4 hours ago

/remove-lifecycle stale

kubernetes / perf-tests

ci-kubernetes-e2e-gci-gce-scalability-watch-list-off has unexpectedly high LIST latency #2287