openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.49k stars 4.7k forks source link

Some containers take multiple minutes to start, resulting in timeouts and test failures #11016

Closed 0xmichalis closed 6 years ago

0xmichalis commented 8 years ago

Our deployment test suite has started flaking more frequently due to deployer pods needing more time to become ready. We don't have any valuable data since all of our flakes are due to our tests being time-bounded.

See: https://github.com/openshift/origin/pull/11001 and related flakes: https://github.com/openshift/origin/issues/10951 https://github.com/openshift/origin/issues/11008 https://github.com/openshift/origin/issues/10989

cc: @derekwaynecarr @smarterclayton @mfojtik

timothysc commented 7 years ago

/cc @mffiedler ...

bparees commented 7 years ago

still seeing issues from this: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/9556/

(https://github.com/openshift/origin/issues/9076 was duped to this issue)

bparees commented 7 years ago

@kargakis are we closing in on a solution to this?

bparees commented 7 years ago

sorry, question should be directed to @jwhonce and @soltysh

bparees commented 7 years ago

and if not i'd like to reopen https://github.com/openshift/origin/issues/11649 (which was closed as dupe of this) so a workaround can be put in place because it's cropping up a lot lately.

soltysh commented 7 years ago

@bparees for now only increasing timeouts for serial re-appearances is the only solution :/

bparees commented 7 years ago

@soltysh ok, then i don't think we should be duping specific test case flakes to this issue, we need to fix each flake. I'm going to reopen #11649.

soltysh commented 7 years ago

Yes, please, it'll help us find those cases.

smarterclayton commented 7 years ago
Latency metrics for node 172.18.13.129
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:2m1.879512s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:1m37.203206s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:50.928686s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:48.959779s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:48.823987s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:46.16127s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:41.035892s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:40.465985s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:39.089519s}
Dec 21 15:28:31.743: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:36.979561s}
Dec 21 15:28:31.743: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.90826s}
Dec 21 15:28:31.743: INFO: {Operation:GetPods Method:container_manager_latency_microseconds Quantile:0.99 Latency:31.907478s}
Dec 21 15:28:31.743: INFO: {Operation:inspect_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.352457s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:13.037115s}

Wow.

smarterclayton commented 7 years ago

This is a docker problem. I'm seeing 1/4 conformance runs fail because of this, and start latencies in excess of 2-3 minutes during those runs. Something is broken in Docker, possible related to the parallel builds hang.

Raising this to P0 - @jwhonce if you can make sure the appropriate investigations are taking place. This would be a production blocker for many use cases (loaded node takes many minutes to start some pods).

smarterclayton commented 7 years ago

There was at least one container that took 4 minutes to start:

Latency metrics for node ci-pr65-ig-n-x0v5.c.openshift-gce-devel-ci.internal
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:4m0.426184s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:2m33.107112s}
Jan  6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:2m15.985041s}
Jan  6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m3.638843s}
Jan  6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m3.638843s}
Jan  6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:1m36.1392s}
Jan  6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:1m26.463562s}
Jan  6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:49.979557s}
Jan  6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:47.685318s}
Jan  6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:37.203829s}
Jan  6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.249116s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:30.389475s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:24.973853s}
Jan  6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:21.276118s}
Jan  6 00:09:38.454: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:19.079327s}
Jan  6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.9 Latency:17.403305s}
Jan  6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:12.668864s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.9 Latency:12.609544s}
Jan  6 00:09:38.454: INFO: {Operation:remove_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:11.227787s}
Jan  6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.5 Latency:10.568633s}
Jan  6 00:09:38.454: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:10.550709s}
jwhonce commented 7 years ago

@ncdc Do you have a guess at the number of containers being tracked by the docker daemon when you start seeing the excessive latencies? $(docker info |grep 'Containers:') Thanks.

smarterclayton commented 7 years ago

Running containers are printed in the junit failures. It's generally between 10-30 containers running, including pause containers, with some CPU spikes. In the ones I was seeing, it was a 2 core system so I have to assume that docker was under CPU pressure (probably not memory).

On Jan 9, 2017, at 6:46 PM, Jhon Honce notifications@github.com wrote:

@ncdc https://github.com/ncdc Do you have a guess at the number of containers being tracked by the docker daemon when you start seeing the excessive latencies? $(docker info |grep 'Containers:') Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-271444068, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_pzIN2FUlsSIdJNH5pdmK_5AHmR9Uks5rQsbCgaJpZM4KBw1L .

soltysh commented 7 years ago

@jwhonce running oc run latency-test --image gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you those failing containers, it's from the linked doc.

smarterclayton commented 7 years ago

My concern now is that we've moved from a "things are 50% slower" to "something is hanging / pausing and timing out", where the former applies to simple containers and the latter is complex things running at the same time (like builds happening at the same time).

On Tue, Jan 10, 2017 at 6:31 AM, Maciej Szulik notifications@github.com wrote:

@jwhonce https://github.com/jwhonce running oc run latency-test --image gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you those failing containers, it's from the linked doc https://docs.google.com/document/d/1AaNZTggal-OUjgJah7FV4mWYNS-in5cHTBLf0BDIkh4/edit .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-271552889, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p36GbT8rStex4wjuUmlAacGpSTMUks5rQ2wXgaJpZM4KBw1L .

smarterclayton commented 7 years ago

I'm finally gathering metrics data from the GCE cluster runs (so that we have prometheus data that can at least tell us if we cross certain thresholds). I'm not sure however that default 0.99 is going to capture outliers though - we don't really know whether the 3m is an outlier (entire system goes into IO inversion and one container stops for 3 minutes) or whether it's better distributed. Hoping to get that with the latest metrics: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/332/ (has links to .metrics files from each nodes that are the prometheus captures).

I did increase the GCE SSD sizes which gives us a small bump in IOP quota, but still seeing some throttles.

timothysc commented 7 years ago

@vikaslaad and @mffiedler have started a new experiment today and we should have data soon-ish.

bparees commented 7 years ago

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/48/

smarterclayton commented 7 years ago

This may be https://github.com/openshift/origin/issues/12784 instead

On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com wrote:

https://ci.openshift.redhat.com/jenkins/job/test_pull_ requests_origin_future/48/

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-278789644, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L .

smarterclayton commented 7 years ago

Even though 2m is an example of this problem

On Thu, Feb 9, 2017 at 5:21 PM, Clayton Coleman ccoleman@redhat.com wrote:

This may be https://github.com/openshift/origin/issues/12784 instead

On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com wrote:

https://ci.openshift.redhat.com/jenkins/job/test_pull_reques ts_origin_future/48/

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-278789644, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L .

smarterclayton commented 7 years ago

Here's an example of where we waited a very long time and there were no running pods - this would have to be an incredibly slow layer creation to be the docker lock issue:

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/870/testReport/junit/(root)/Extended/deploymentconfigs_should_respect_image_stream_tag_reference_policy__Conformance_/

I don't see more than a few pods running on any node.

smarterclayton commented 7 years ago

I did a run on overlayFS tonight. I still see:

Mar  6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.5 Latency:2m2.536474s}
Mar  6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m2.536474s}
Mar  6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m2.536474s}
Mar  6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:51.995394s}
Mar  6 22:42:09.205: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:30.130964s}
Mar  6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.470511s}
Mar  6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:23.547013s}
Mar  6 22:42:09.205: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:18.910722s}
Mar  6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:17.537977s}

I'm still seeing IO limits being poked (consistent with devicemapper, actually, although that could just be the tests running faster or docker pulls still being the bulk of the workload).

@ncdc @sjenning @derekwaynecarr @jwhonce I should be able to give you repeatable clusters with overlay tomorrow - what do we need to do to verify that this is the CreateLayer previously fingered as the culprit, vs something more fundamental (say, the Docker daemon being saturated and the kubelet just giving up)? This is looking more like a kubelet / docker interaction.

derekwaynecarr commented 7 years ago

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1422511

we are hitting cases where docker list images is > 1min to complete, machine is not actually memory starved in those instances where this occurs.

liggitt commented 7 years ago

This p0 has been open for 10 months, and inactive for 4. Is it actionable? If not, is it really a p0?

jwhonce commented 7 years ago

@liggitt I leave the priority to you to set/change. As to news on the issue, I finally have a reproducer that uses only docker and am working with the performance team to further isolate the cause of the slow downs.

jwhonce commented 7 years ago

Ops updating their monitoring https://trello.com/c/cB9lX5sb . Suspected cause is volume BurstBalance being exhausted.

xThomo commented 7 years ago

@jwhonce could you give more details on the reproduction of the issue and is there any update on this?

jwhonce commented 7 years ago

@xThomo The reproducer used docker run in multiple threads which determined the BurstBalance on the volumes was exhausted. The Ops team is now implementing monitoring of that metric to determine if that is the root cause for these issues. Or, if there are other gating factors.

smarterclayton commented 6 years ago

All other issues are separate now. Closing.