Speed up getting pod statuses in PLEG when there are many changes

kubernetes / kubernetes

Production-Grade Container Scheduling and Management

https://kubernetes.io

Apache License 2.0

110.64k stars 39.55k forks source link

Speed up getting pod statuses in PLEG when there are many changes #26394

Open yujuhong opened 8 years ago

yujuhong commented 8 years ago

Forked from https://github.com/kubernetes/kubernetes/issues/23591#issuecomment-203042820 Creating a new issue so I won't forget.

PLEG serially gets status for every pod, where at least one container has encountered a state transition. This becomes the bottleneck if a lot of containers changed in one relist period. We were aware of this problem and had come up with a few options before:

Add a new GetPodStatus variant to bypass the docker ps -a call since pleg already have this information. The downside is that this is a very docker-specific optimization and doesn't make sense for other runtimes.
Parallelize the process.
Do nothing. Assume docker is busy and process slowly.

I think we can do (2) with a small number of goroutines (e.g., 2) to speed up the time.

timothysc commented 8 years ago

@yujuhong do you have a list of items that you could federate? In general, the responsiveness of the kubelet is an issue.

/cc @kubernetes/rh-cluster-infra @rrati

goltermann commented 8 years ago

@yujuhong is this still in plan for 1.4?

dims commented 7 years ago

This needs to be triaged as a release-blocker or not for 1.5 @yujuhong

yujuhong commented 7 years ago

I had a PR for parallelizing inspections but I couldn't observe significant differences in performances in my measurements. I think we need to benchmark/profile kubelet to identify the bottleneck first.

I am removing the milestone and mark this as backlog. Will raise the priority if this becomes a serious issue.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

swatisehgal commented 3 years ago

/triage accepted

k8s-triage-robot commented 1 year ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.