Closed SergeyKanzhelev closed 2 years ago
/sig node
The runc bump to rc94 is being reverted due to behavior changes around cgroups that need adjusting to. That seems likely to complicate taking rc94 back to the release-1.21 branch.
How can the performance regression in runc between rc92 (in Kubernetes 1.20) and rc93 (in Kubernetes 1.21) be resolved in a way that doesn't require bumping to rc94?
cc: @kolyshkin @AkihiroSuda
/triage accepted
/cc @dims
How can the performance regression in runc between rc92 (in Kubernetes 1.20) and rc93 (in Kubernetes 1.21) be resolved in a way that doesn't require bumping to rc94?
This would require a fork of runc with the backport of https://github.com/opencontainers/runc/pull/2921.
I think it makes more sense to fix rc94 test failures (looking at it).
I think it makes more sense to fix rc94 test failures (looking at it).
was it just test failures? I can't tell from https://github.com/kubernetes/kubernetes/pull/101888#issuecomment-843377061 whether there's a functional change needed or not. If it is only test changes, that seems more plausible to backport.
I think it makes more sense to fix rc94 test failures (looking at it).
Did you take a look at my comment in https://github.com/kubernetes/kubernetes/pull/101888#pullrequestreview-661762880 @kolyshkin?
I also wrote it in a bit more detail here: https://github.com/kubernetes/kubernetes/pull/101888#issuecomment-843496893
(It is getting late in Europe right now, so I will probably not answer before tomorrow)
This would require a fork of runc with the backport of opencontainers/runc#2921.
Isn't it possible to create a new branch from rc93
and cherry-pick the changes, and then cut a new release?
@odinuge the runc
project it's hard to get a release out. they don't maintain branches and rc94 is already out. there will be more confusion around what to call this new release too (rc931???). it's a mess. So no we have to move forward, help them fix what needs to be fixed and push for a rc95 out quickly if not a 1.0.0 final of runc.
@dims:
@odinuge the runc project it's hard to get a release out. they don't maintain branches and rc94 is already out. there will be more confusion around what to call this new release too (rc931???). it's a mess.
Well, when it comes to such things as naming etc. I am quite pragmatic, and would like to solve the problems as safe and efficient as possible. However, if the runc people doesn't do that, that is totally fine, and I have no objections on that. :smile: We also lack some sig-node CI coverage on older releases, making backporting huge dependency changes a bit risky.
So no we have to move forward, help them fix what needs to be fixed and push for a rc95 out quickly if not a 1.0.0 final of runc.
Not sure if we are talking about the same thing. There isn't any (afaik.) issues with runc rc94, only the fact that the API was changed without fixing the relevant things in k8s/k8s. So I don't think there is any reason we need rc95, unless there are other issues or regressions I am not aware of.
Have anyone verified and gathered data on weather #102147 / #101888 do fix this performance regression, and how much the change is? It is probably not critical for merging it into master, but it would make sense to back a backport with some data about the performance increase/decrease seen with the fix. @kolyshkin or @SergeyKanzhelev?
@odinuge My performance analysis (of runc's libcontainers/cgroups/fs.GetStats) are at https://github.com/opencontainers/runc/pull/2921. I haven't done any testing with Kubernetes and/or cAdvisor, only relying on info gathered in https://bugzilla.redhat.com/show_bug.cgi?id=1953102
@SergeyKanzhelev can you verify the regression is fixed with the next patch release of 1.21?
We've verified this on OpenShift 4.8 (1.21-based) after backport: https://bugzilla.redhat.com/show_bug.cgi?id=1953102#c28
/close
@ehashman: Closing this issue.
/reopen
Rerun some tests and seeing better perf, but still some degradation - 10-15% depending on pods count.
@SergeyKanzhelev: Reopened this issue.
reopened https://github.com/kubernetes/perf-tests/issues/1789#issuecomment-881090458 - if we can see perf results comparison with 1.20 in scalability tests
You can dump flame graphs from the kubelet and view them in the interactive browser like so, I'd suggest a 30m CPU sample as the default 1m is usually too small to detect the issues.
kubectl proxy &
go tool pprof -http localhost:8888 hyperkube-4.8.bin http://127.0.0.1:8002/api/v1/nodes/ip-10-0-221-191.ec2.internal/proxy/debug/pprof/profile?seconds=1800
hyperkube-4.8.bin
is the 1.21 kubelet binary which I happen to have downloaded locally (OpenShift 4.8 in my case), and this is tested against a proxy to an AWS OpenShift cluster.
These comments/screenshots may also help: https://bugzilla.redhat.com/show_bug.cgi?id=1953102#c7
/kind regression
/priority important-soon
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@SergeyKanzhelev is this still unresolved? if so, is it still being investigated?
I believe this is not being actively investigated. @SergeyKanzhelev has suggested we close it.
/close
@ehashman: Closing this issue.
/cc @ehashman /cc @bobbypage
What happened:
1.21 kubelet shows perf degradation on our tests. Also here https://bugzilla.redhat.com/show_bug.cgi?id=1953102.
The root cause seems to be runc, should be fixed here: https://github.com/kubernetes/kubernetes/pull/101888.
This issue is to discuss backporting of the fix to 1.21 release.
What you expected to happen:
Kubelet doesn't show significant change in CPU between 1.20 and 1.21.
How to reproduce it (as minimally and precisely as possible):
No special set up required. Just perf observation.
Anything else we need to know?:
Environment:
kubectl version
): 1.21