openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

Some containers take multiple minutes to start, resulting in timeouts and test failures #11016

Closed 0xmichalis closed 6 years ago

0xmichalis commented 8 years ago

Our deployment test suite has started flaking more frequently due to deployer pods needing more time to become ready. We don't have any valuable data since all of our flakes are due to our tests being time-bounded.

See: https://github.com/openshift/origin/pull/11001 and related flakes: https://github.com/openshift/origin/issues/10951 https://github.com/openshift/origin/issues/11008 https://github.com/openshift/origin/issues/10989

cc: @derekwaynecarr @smarterclayton @mfojtik

smarterclayton commented 8 years ago

Pods appear to be scheduled, but long delays between starting infra container and the remaining containers (in the cases we've seen).

derekwaynecarr commented 8 years ago

There was a similar BZ reported in kube 1.3 when running density tests from 0-100 pods on AWS.

@timothysc team was to investigate that issue to root cause. this looks the same. at the time, we suspected there were global locking issues in openshift-sdn.

/cc @eparis

eparis commented 8 years ago

We did actually manage to eliminate the sdn however, before we go down that route again....

smarterclayton commented 8 years ago

After some investigation today (when SDN was not in use) it may have been correlated to throttling on secret retrievals. In a local environment (bone stock origin 1.4.0-alpha.0) I was able to easily reproduce long delays when multiple pods are being scheduled. Calls to docker appeared to be fast, but the kubelet itself was reporting kubelet_docker_operation_latencies in the tens of seconds even for the 50th percentile (but the 90 and 99th were only a few seconds higher).

On Tue, Sep 20, 2016 at 5:02 PM, Eric Paris notifications@github.com wrote:

We did actually manage to eliminate the sdn however, before we go down that route again....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-248433329, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p77Cwb9UTzI91FEYEHI61olYNCpuks5qsEnygaJpZM4KBw1L .

timothysc commented 8 years ago

Here is the BZ in question: https://bugzilla.redhat.com/show_bug.cgi?id=1343196

It's not the scheduler.

csrwng commented 8 years ago

Added the test-flake label because other flake issues have been closed/duped in favor of this one.

csrwng commented 8 years ago

I see these errors in a separate failed extended image test:

E0920 17:47:28.496490   16730 kubelet.go:1816] Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]; skipping pod
E0920 17:47:28.496507   16730 pod_workers.go:184] Error syncing pod 49f979b4-7f7b-11e6-a225-0e03779dc447, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496820   16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]
I0920 17:47:28.496861   16730 server.go:608] Event(api.ObjectReference{Kind:"Pod", Namespace:"extended-test-mongodb-replica-a41h3-f5b9t", Name:"mongodb-1-deploy", UID:"49f979b4-7f7b-11e6-a225-0e03779dc447", APIVersion:"v1", ResourceVersion:"3642", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "mongodb-1-deploy_extended-test-mongodb-replica-a41h3-f5b9t(49f979b4-7f7b-11e6-a225-0e03779dc447)": timeout expired waiting for volumes to attach/mount for pod "mongodb-1-deploy"/"extended-test-mongodb-replica-a41h3-f5b9t". list of unattached/unmounted volumes=[deployer-token-nactn]

which seem to confirm the theory that secrets are taking too long

eparis commented 8 years ago

reassign to tstclair as he was planning to root cause during 1.5.

smarterclayton commented 8 years ago

We have a fair amount of info that this is not secrets (in most cases), and that Docker is taking extended periods of time to respond to create calls.

On Thu, Sep 22, 2016 at 12:02 PM, Eric Paris notifications@github.com wrote:

reassign to tstclair as he was planning to root cause during 1.5.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-248947772, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_pwxcBTdDli9JYY9PyMaBAriDyGKeks5qsqaJgaJpZM4KBw1L .

mfojtik commented 8 years ago

@kargakis got: /data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:715 Expected an error to have occurred. Got:

: nil /data/src/github.com/openshift/origin/test/extended/deployments/deployments.go:696
bparees commented 8 years ago

is there no timeout we can bump to unblock the merge queue? or disable some of these tests? we're totally blocked by this, almost nothing is merging.

smarterclayton commented 8 years ago

Please keep this issue up to date with fixes.

On Mon, Sep 26, 2016 at 12:42 PM, Ben Parees notifications@github.com wrote:

is there no timeout we can bump to unblock the merge queue? or disable some of these tests? we're totally blocked by this, almost nothing is merging.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-249625797, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p6yXd0wZbIz8L0cKwuC5rK8onhP2ks5qt_YPgaJpZM4KBw1L .

ncdc commented 8 years ago

I switched my docker graph driver from devicemapper to overlay (which also required that I disable docker's selinux support), and the timings were significantly better. I'd say it went from taking 20-40 seconds in between starting the infra and actual containers (on average) to no more than 5 seconds, with an average of probably 1-2.

Note this is not statistically significant, as I only ran the overlay test 1 time. But definitely something to investigate (devicemapper graph driver possible contention somewhere).

smarterclayton commented 8 years ago

Bumping to p0, this is still blocking merges across the cluster. We need a simple fix in the short term that allows us to stop flaking.

smarterclayton commented 8 years ago

If we have to increase timeouts on certain tests let's do it, but I want the flakes gone.

soltysh commented 8 years ago

This test fails on my machine consistently, the problem being one of the pods of the deployment fails the readiness checks. I've tried changing timeouts on readiness and in tests and sooner or later at some point in time one of the pod will start failing the readiness checks. Not sure what to check more but I'll be debugging more... Posting here just to notify about the progress...

0xmichalis commented 8 years ago

Try to use a different fixture that is not strict about readiness (the current test requires both pods to become ready) since this test doesn't test readiness.

On Thu, Sep 29, 2016 at 5:06 PM, Maciej Szulik notifications@github.com wrote:

This test https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/deployments/deployments.go#L382 fails on my machine consistently, the problem being one of the pods of the deployment https://github.com/openshift/origin/blob/8ce6de44e16f506a921de75d1e63b0d2ea49195d/test/extended/testdata/deployment-simple.yaml fails the readiness checks. I've tried changing timeouts on readiness and in tests and sooner or later at some point in time one of the pod will start failing the readiness checks. Not sure what to check more but I'll be debugging more... Posting here just to notify about the progress...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-250493844, or mute the thread https://github.com/notifications/unsubscribe-auth/ADuFfw_W9ornU4EhG8NyIOhvwRd6zmZzks5qu9PZgaJpZM4KBw1L .

soltysh commented 8 years ago

The only time I can get this test to pass always is to change the readiness to tcp probe instead of http one, not sure if it's desirable.

smarterclayton commented 8 years ago

I don't understand why that would make a difference?

On Fri, Sep 30, 2016 at 9:06 AM, Maciej Szulik notifications@github.com wrote:

The only time I can get this test to pass always is to change the readiness to tcp probe instead of http one, not sure if it's desirable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-250739522, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p5nRqMEF8h0laAmtskmz2ZBSxN6Nks5qvQlQgaJpZM4KBw1L .

soltysh commented 8 years ago

I don't understand why that would make a difference?

The reason httpGet are failing is usually due to timeouts on get, but even significantly increasing (up to 10s) those timeouts didn't help. But frankly I have no idea why the two differ, maybe there's some bug in the probes... Will check it out...

smarterclayton commented 8 years ago

Is it because the container is starting and failing and Get actually exposes the app failure?

On Fri, Sep 30, 2016 at 11:17 AM, Maciej Szulik notifications@github.com wrote:

I don't understand why that would make a difference?

The reason httpGet are failing is usually due to timeouts on get, but even significantly increasing (up to 10s) those timeouts didn't help. But frankly I have no idea why the two differ, maybe there's some bug in the probes... Will check it out...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-250771729, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p3Z2wlq0CeJfdJPsN7hSU-ikZOvlks5qvSfvgaJpZM4KBw1L .

pweil- commented 8 years ago

@mfojtik @soltysh bump. Status here?

ncdc commented 8 years ago

I think the original issue description (post 1.4 rebase pod startup times have increased) is not happening because of the rebase. It's because of the upgrade from docker 1.9 to 1.10. Right?

soltysh commented 8 years ago

Yup, that's what we've identified. I haven't seen anything else in there.

mfojtik commented 8 years ago

@ncdc and we should now run all CI on docker 1.12 right? so this should no longer happen.

soltysh commented 8 years ago

I vote for closing this issue, and reopening if the problem bites us again.

ncdc commented 8 years ago

@mfojtik that will only be true if docker 1.12 is faster than 1.10

mfojtik commented 8 years ago

@ncdc I smell it is not: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/8031/testReport/junit/(root)/Extended/_k8s_io__Services_should_serve_multiport_endpoints_from_pods__Conformance_/

ncdc commented 8 years ago

@danmcp @smarterclayton maybe we need 1 jenkins job that tests using overlay instead of devicemapper, for comparison

mfojtik commented 8 years ago

@ncdc @danmcp @smarterclayton or an option to job where you can choose the storage

soltysh commented 8 years ago

Choosing is not good, it has to be permanent, iow. one job running on devicemapper, other on overlay.

0xmichalis commented 8 years ago

Most deployment flakes are related to this. @mfojtik @smarterclayton not sure if we should extend the deployment timeout any more. To be honest, I would prefer using the default deployment timeout (10m) and be done with this at the expense of some tests becoming slower.

smarterclayton commented 8 years ago

The aggressive timeout doesn't seem to help us. Is there any evidence of that?

On Nov 1, 2016, at 8:18 AM, Michail Kargakis notifications@github.com wrote:

Most deployment flakes are related to this. @mfojtik https://github.com/mfojtik @smarterclayton https://github.com/smarterclayton not sure if we should extend the deployment timeout any more. To be honest, I would prefer using the default deployment timeout (10m) and be done with this at the expense of some tests becoming slower.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-257552709, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p1EkeY4x66i4bgCPfGgv1W2gIXdDks5q5y4WgaJpZM4KBw1L .

0xmichalis commented 8 years ago

@smarterclayton https://github.com/openshift/origin/issues/11685 the deployer pod succeeds around 5min after it started. The deployment pod was scaled up from the start but didn't transition to Ready until ~5min later. We end up failing the test.

smarterclayton commented 8 years ago

Why did it take 5 min? That's completely unexpected.

On Tue, Nov 1, 2016 at 10:53 AM, Michail Kargakis notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton #11685 https://github.com/openshift/origin/issues/11685 the deployer pod succeeds around 5min after it started. The deployment pod was scaled up from the start but didn't transition to Ready until ~5min later. We end up failing the test.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-257587296, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p2w7WPiGWT3lyvJPD1Qp3ualaKVcks5q51J1gaJpZM4KBw1L .

jwhonce commented 8 years ago

@rhvgoyal Please take a look at this. Thanks.

rhvgoyal commented 8 years ago

This is so high level, that I have no idea. So far none of the data is suggesting that it is a storage issue. if this is a problem, please narrow it down.

ncdc commented 8 years ago

@rhvgoyal we have a doc (I need to get a link to it) showing that the devicemapper timings are slower when you go from docker 1.9 to 1.10. I apologize for not having this data handy right now. I believe @soltysh has it somewhere. We'll get it to you as soon as we can.

soltysh commented 8 years ago

It's https://docs.google.com/a/redhat.com/document/d/1AaNZTggal-OUjgJah7FV4mWYNS-in5cHTBLf0BDIkh4/edit?usp=sharing

soltysh commented 8 years ago

I'll try to run 1.12 tests and add them to that document. @rhvgoyal where can I get 1.12 binaries/package for F24?

ncdc commented 8 years ago

I'm running a combination of http://koji.fedoraproject.org/koji/taskinfo?taskID=16262294 and http://koji.fedoraproject.org/koji/buildinfo?buildID=812817 (but you probably could update the latter to 1.12.3-2).

I have these packages installed:

smarterclayton commented 7 years ago

Given the rate of flakes, we need to do something here.

smarterclayton commented 7 years ago

Probably increasing a few timeouts.

soltysh commented 7 years ago

Agreed, I'm seeing this in half of the failures in #11916. The only question is what can we do? I doubt switching to OverlayFS is an option?

smarterclayton commented 7 years ago

It's not.

On Mon, Nov 21, 2016 at 11:40 AM, Maciej Szulik notifications@github.com wrote:

Agreed, I'm seeing this in half of the failures in #11916 https://github.com/openshift/origin/pull/11916. The only question is what can we do? I doubt switching to OverlayFS is an option?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-261992761, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_pzZgZuv0K_N9AyfO7L8oc4T0jSw7ks5rAcmbgaJpZM4KBw1L .

soltysh commented 7 years ago

Increasing a few timeouts seems reasonable. I'll dig into the failures and see what's the most frequent ones and where it make sense.

soltysh commented 7 years ago

This is not 1.4 blocker, because the longer times are not introduced inside origin code but rather by newer docker version. This document contains detailed log of the test performed with different docker versions vs origin 1.4 and 1.3 vs different storage drivers (overlayfs and devicemapper).

ncdc commented 7 years ago

@jwhonce @rhvgoyal bump - please let us know if there's anything we can do to help you debug this

soltysh commented 7 years ago

I've updated the doc with docker 1.12 tests against origin 1.4 and 1.5 (just devicemapper). It looks like origin 1.4 with docker 1.12 is additional second slower than with 1.10 and origin 1.5 is even additional second slower (time in parents is median):

Origin 1.4 + Docker 1.10 Origin 1.4 + Docker 1.12 Origin 1.5 + Docker 1.10 Origin 1.5 + Docker 1.12
CreateContainer 10.6s (12.9s) 11.3s (14.1s) 11.7s (14.8s) 12.1s (15.3s)
runContainer 12.8s (15.2s) 11.7s (14.5) 12.4s (15.4s) 12.5s (15.7s)

EDIT: added origin 1.5 + docker 1.10 times

soltysh commented 7 years ago

The docs was updated also.