Closed 0xmichalis closed 6 years ago
Pods appear to be scheduled, but long delays between starting infra container and the remaining containers (in the cases we've seen).
There was a similar BZ reported in kube 1.3 when running density tests from 0-100 pods on AWS.
@timothysc team was to investigate that issue to root cause. this looks the same. at the time, we suspected there were global locking issues in openshift-sdn.
/cc @eparis
We did actually manage to eliminate the sdn however, before we go down that route again....
After some investigation today (when SDN was not in use) it may have been correlated to throttling on secret retrievals. In a local environment (bone stock origin 1.4.0-alpha.0) I was able to easily reproduce long delays when multiple pods are being scheduled. Calls to docker appeared to be fast, but the kubelet itself was reporting kubelet_docker_operation_latencies in the tens of seconds even for the 50th percentile (but the 90 and 99th were only a few seconds higher).
On Tue, Sep 20, 2016 at 5:02 PM, Eric Paris notifications@github.com wrote:
We did actually manage to eliminate the sdn however, before we go down that route again....
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-248433329, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p77Cwb9UTzI91FEYEHI61olYNCpuks5qsEnygaJpZM4KBw1L .
Here is the BZ in question: https://bugzilla.redhat.com/show_bug.cgi?id=1343196
It's not the scheduler.
Added the test-flake label because other flake issues have been closed/duped in favor of this one.
I see these errors in a separate failed extended image test:
E0920 17:47:28.496490 16730 kubelet.go:1816] Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]; skipping pod
E0920 17:47:28.496507 16730 pod_workers.go:184] Error syncing pod 49f979b4-7f7b-11e6-a225-0e03779dc447, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496820 16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496861 16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
which seem to confirm the theory that secrets are taking too long
reassign to tstclair as he was planning to root cause during 1.5.
We have a fair amount of info that this is not secrets (in most cases), and that Docker is taking extended periods of time to respond to create calls.
On Thu, Sep 22, 2016 at 12:02 PM, Eric Paris notifications@github.com wrote:
reassign to tstclair as he was planning to root cause during 1.5.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-248947772, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_pwxcBTdDli9JYY9PyMaBAriDyGKeks5qsqaJgaJpZM4KBw1L .
@kargakis got: /data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:715 Expected an error to have occurred. Got:
is there no timeout we can bump to unblock the merge queue? or disable some of these tests? we're totally blocked by this, almost nothing is merging.
Please keep this issue up to date with fixes.
On Mon, Sep 26, 2016 at 12:42 PM, Ben Parees notifications@github.com wrote:
is there no timeout we can bump to unblock the merge queue? or disable some of these tests? we're totally blocked by this, almost nothing is merging.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-249625797, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p6yXd0wZbIz8L0cKwuC5rK8onhP2ks5qt_YPgaJpZM4KBw1L .
I switched my docker graph driver from devicemapper to overlay (which also required that I disable docker's selinux support), and the timings were significantly better. I'd say it went from taking 20-40 seconds in between starting the infra and actual containers (on average) to no more than 5 seconds, with an average of probably 1-2.
Note this is not statistically significant, as I only ran the overlay test 1 time. But definitely something to investigate (devicemapper graph driver possible contention somewhere).
Bumping to p0, this is still blocking merges across the cluster. We need a simple fix in the short term that allows us to stop flaking.
If we have to increase timeouts on certain tests let's do it, but I want the flakes gone.
This test fails on my machine consistently, the problem being one of the pods of the deployment fails the readiness checks. I've tried changing timeouts on readiness and in tests and sooner or later at some point in time one of the pod will start failing the readiness checks. Not sure what to check more but I'll be debugging more... Posting here just to notify about the progress...
Try to use a different fixture that is not strict about readiness (the current test requires both pods to become ready) since this test doesn't test readiness.
On Thu, Sep 29, 2016 at 5:06 PM, Maciej Szulik notifications@github.com wrote:
This test https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/deployments/deployments.go#L382 fails on my machine consistently, the problem being one of the pods of the deployment https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/testdata/deployment-simple.yaml fails the readiness checks. I've tried changing timeouts on readiness and in tests and sooner or later at some point in time one of the pod will start failing the readiness checks. Not sure what to check more but I'll be debugging more... Posting here just to notify about the progress...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-250493844, or mute the thread https://github.com/notifications/unsubscribe-auth/ADuFfw_W9ornU4EhG8NyIOhvwRd6zmZzks5qu9PZgaJpZM4KBw1L .
The only time I can get this test to pass always is to change the readiness to tcp probe instead of http one, not sure if it's desirable.
I don't understand why that would make a difference?
On Fri, Sep 30, 2016 at 9:06 AM, Maciej Szulik notifications@github.com wrote:
The only time I can get this test to pass always is to change the readiness to tcp probe instead of http one, not sure if it's desirable.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-250739522, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p5nRqMEF8h0laAmtskmz2ZBSxN6Nks5qvQlQgaJpZM4KBw1L .
I don't understand why that would make a difference?
The reason httpGet are failing is usually due to timeouts on get, but even significantly increasing (up to 10s) those timeouts didn't help. But frankly I have no idea why the two differ, maybe there's some bug in the probes... Will check it out...
Is it because the container is starting and failing and Get actually exposes the app failure?
On Fri, Sep 30, 2016 at 11:17 AM, Maciej Szulik notifications@github.com wrote:
I don't understand why that would make a difference?
The reason httpGet are failing is usually due to timeouts on get, but even significantly increasing (up to 10s) those timeouts didn't help. But frankly I have no idea why the two differ, maybe there's some bug in the probes... Will check it out...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-250771729, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p3Z2wlq0CeJfdJPsN7hSU-ikZOvlks5qvSfvgaJpZM4KBw1L .
@mfojtik @soltysh bump. Status here?
I think the original issue description (post 1.4 rebase pod startup times have increased) is not happening because of the rebase. It's because of the upgrade from docker 1.9 to 1.10. Right?
Yup, that's what we've identified. I haven't seen anything else in there.
@ncdc and we should now run all CI on docker 1.12 right? so this should no longer happen.
I vote for closing this issue, and reopening if the problem bites us again.
@mfojtik that will only be true if docker 1.12 is faster than 1.10
@danmcp @smarterclayton maybe we need 1 jenkins job that tests using overlay instead of devicemapper, for comparison
@ncdc @danmcp @smarterclayton or an option to job where you can choose the storage
Choosing is not good, it has to be permanent, iow. one job running on devicemapper, other on overlay.
Most deployment flakes are related to this. @mfojtik @smarterclayton not sure if we should extend the deployment timeout any more. To be honest, I would prefer using the default deployment timeout (10m) and be done with this at the expense of some tests becoming slower.
The aggressive timeout doesn't seem to help us. Is there any evidence of that?
On Nov 1, 2016, at 8:18 AM, Michail Kargakis notifications@github.com wrote:
Most deployment flakes are related to this. @mfojtik https://github.com/mfojtik @smarterclayton https://github.com/smarterclayton not sure if we should extend the deployment timeout any more. To be honest, I would prefer using the default deployment timeout (10m) and be done with this at the expense of some tests becoming slower.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-257552709, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p1EkeY4x66i4bgCPfGgv1W2gIXdDks5q5y4WgaJpZM4KBw1L .
@smarterclayton https://github.com/openshift/origin/issues/11685 the deployer pod succeeds around 5min after it started. The deployment pod was scaled up from the start but didn't transition to Ready until ~5min later. We end up failing the test.
Why did it take 5 min? That's completely unexpected.
On Tue, Nov 1, 2016 at 10:53 AM, Michail Kargakis notifications@github.com wrote:
@smarterclayton https://github.com/smarterclayton #11685 https://github.com/openshift/origin/issues/11685 the deployer pod succeeds around 5min after it started. The deployment pod was scaled up from the start but didn't transition to Ready until ~5min later. We end up failing the test.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-257587296, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p2w7WPiGWT3lyvJPD1Qp3ualaKVcks5q51J1gaJpZM4KBw1L .
@rhvgoyal Please take a look at this. Thanks.
This is so high level, that I have no idea. So far none of the data is suggesting that it is a storage issue. if this is a problem, please narrow it down.
@rhvgoyal we have a doc (I need to get a link to it) showing that the devicemapper timings are slower when you go from docker 1.9 to 1.10. I apologize for not having this data handy right now. I believe @soltysh has it somewhere. We'll get it to you as soon as we can.
I'll try to run 1.12 tests and add them to that document. @rhvgoyal where can I get 1.12 binaries/package for F24?
I'm running a combination of http://koji.fedoraproject.org/koji/taskinfo?taskID=16262294 and http://koji.fedoraproject.org/koji/buildinfo?buildID=812817 (but you probably could update the latter to 1.12.3-2).
I have these packages installed:
Given the rate of flakes, we need to do something here.
Probably increasing a few timeouts.
Agreed, I'm seeing this in half of the failures in #11916. The only question is what can we do? I doubt switching to OverlayFS is an option?
It's not.
On Mon, Nov 21, 2016 at 11:40 AM, Maciej Szulik notifications@github.com wrote:
Agreed, I'm seeing this in half of the failures in #11916 https://github.com/openshift/origin/pull/11916. The only question is what can we do? I doubt switching to OverlayFS is an option?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-261992761, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_pzZgZuv0K_N9AyfO7L8oc4T0jSw7ks5rAcmbgaJpZM4KBw1L .
Increasing a few timeouts seems reasonable. I'll dig into the failures and see what's the most frequent ones and where it make sense.
This is not 1.4 blocker, because the longer times are not introduced inside origin code but rather by newer docker version. This document contains detailed log of the test performed with different docker versions vs origin 1.4 and 1.3 vs different storage drivers (overlayfs and devicemapper).
@jwhonce @rhvgoyal bump - please let us know if there's anything we can do to help you debug this
I've updated the doc with docker 1.12 tests against origin 1.4 and 1.5 (just devicemapper). It looks like origin 1.4 with docker 1.12 is additional second slower than with 1.10 and origin 1.5 is even additional second slower (time in parents is median):
Origin 1.4 + Docker 1.10 | Origin 1.4 + Docker 1.12 | Origin 1.5 + Docker 1.10 | Origin 1.5 + Docker 1.12 | |
---|---|---|---|---|
CreateContainer | 10.6s (12.9s) | 11.3s (14.1s) | 11.7s (14.8s) | 12.1s (15.3s) |
runContainer | 12.8s (15.2s) | 11.7s (14.5) | 12.4s (15.4s) | 12.5s (15.7s) |
EDIT: added origin 1.5 + docker 1.10 times
The docs was updated also.
Our deployment test suite has started flaking more frequently due to deployer pods needing more time to become ready. We don't have any valuable data since all of our flakes are due to our tests being time-bounded.
See: https://github.com/openshift/origin/pull/11001 and related flakes: https://github.com/openshift/origin/issues/10951 https://github.com/openshift/origin/issues/11008 https://github.com/openshift/origin/issues/10989
cc: @derekwaynecarr @smarterclayton @mfojtik