Closed 0xmichalis closed 6 years ago
/cc @mffiedler ...
still seeing issues from this: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/9556/
(https://github.com/openshift/origin/issues/9076 was duped to this issue)
@kargakis are we closing in on a solution to this?
sorry, question should be directed to @jwhonce and @soltysh
and if not i'd like to reopen https://github.com/openshift/origin/issues/11649 (which was closed as dupe of this) so a workaround can be put in place because it's cropping up a lot lately.
@bparees for now only increasing timeouts for serial re-appearances is the only solution :/
@soltysh ok, then i don't think we should be duping specific test case flakes to this issue, we need to fix each flake. I'm going to reopen #11649.
Yes, please, it'll help us find those cases.
Latency metrics for node 172.18.13.129
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:2m1.879512s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:1m37.203206s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:50.928686s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:48.959779s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:48.823987s}
Dec 21 15:28:31.743: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:46.16127s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:45.374806s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:41.035892s}
Dec 21 15:28:31.743: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:40.465985s}
Dec 21 15:28:31.743: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:39.089519s}
Dec 21 15:28:31.743: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:36.979561s}
Dec 21 15:28:31.743: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.90826s}
Dec 21 15:28:31.743: INFO: {Operation:GetPods Method:container_manager_latency_microseconds Quantile:0.99 Latency:31.907478s}
Dec 21 15:28:31.743: INFO: {Operation:inspect_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.352457s}
Dec 21 15:28:31.743: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:13.037115s}
Wow.
This is a docker problem. I'm seeing 1/4 conformance runs fail because of this, and start latencies in excess of 2-3 minutes during those runs. Something is broken in Docker, possible related to the parallel builds hang.
Raising this to P0 - @jwhonce if you can make sure the appropriate investigations are taking place. This would be a production blocker for many use cases (loaded node takes many minutes to start some pods).
There was at least one container that took 4 minutes to start:
Latency metrics for node ci-pr65-ig-n-x0v5.c.openshift-gce-devel-ci.internal
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:4m0.426184s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:2m33.107112s}
Jan 6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:2m15.985041s}
Jan 6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m3.638843s}
Jan 6 00:09:38.454: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m3.638843s}
Jan 6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:1m36.1392s}
Jan 6 00:09:38.454: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:1m26.463562s}
Jan 6 00:09:38.454: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:49.979557s}
Jan 6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.99 Latency:47.685318s}
Jan 6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:37.203829s}
Jan 6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:31.249116s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pleg_relist_latency_microseconds Quantile:0.99 Latency:30.389475s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.99 Latency:24.973853s}
Jan 6 00:09:38.454: INFO: {Operation:runContainerInPod Method:container_manager_latency_microseconds Quantile:0.9 Latency:21.276118s}
Jan 6 00:09:38.454: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:19.079327s}
Jan 6 00:09:38.454: INFO: {Operation:createPodInfraContainer Method:container_manager_latency_microseconds Quantile:0.9 Latency:17.403305s}
Jan 6 00:09:38.454: INFO: {Operation:create_container Method:docker_operations_latency_microseconds Quantile:0.9 Latency:12.668864s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_worker_start_latency_microseconds Quantile:0.9 Latency:12.609544s}
Jan 6 00:09:38.454: INFO: {Operation:remove_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:11.227787s}
Jan 6 00:09:38.454: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.5 Latency:10.568633s}
Jan 6 00:09:38.454: INFO: {Operation:list_containers Method:docker_operations_latency_microseconds Quantile:0.99 Latency:10.550709s}
@ncdc Do you have a guess at the number of containers being tracked by the docker daemon when you start seeing the excessive latencies? $(docker info |grep 'Containers:') Thanks.
Running containers are printed in the junit failures. It's generally between 10-30 containers running, including pause containers, with some CPU spikes. In the ones I was seeing, it was a 2 core system so I have to assume that docker was under CPU pressure (probably not memory).
On Jan 9, 2017, at 6:46 PM, Jhon Honce notifications@github.com wrote:
@ncdc https://github.com/ncdc Do you have a guess at the number of containers being tracked by the docker daemon when you start seeing the excessive latencies? $(docker info |grep 'Containers:') Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-271444068, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_pzIN2FUlsSIdJNH5pdmK_5AHmR9Uks5rQsbCgaJpZM4KBw1L .
@jwhonce running oc run latency-test --image gcr.io/google_containers/pause:2.0 --replicas=30
will give you what you those failing containers, it's from the linked doc.
My concern now is that we've moved from a "things are 50% slower" to "something is hanging / pausing and timing out", where the former applies to simple containers and the latter is complex things running at the same time (like builds happening at the same time).
On Tue, Jan 10, 2017 at 6:31 AM, Maciej Szulik notifications@github.com wrote:
@jwhonce https://github.com/jwhonce running oc run latency-test --image gcr.io/google_containers/pause:2.0 --replicas=30 will give you what you those failing containers, it's from the linked doc https://docs.google.com/document/d/1AaNZTggal-OUjgJah7FV4mWYNS-in5cHTBLf0BDIkh4/edit .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-271552889, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p36GbT8rStex4wjuUmlAacGpSTMUks5rQ2wXgaJpZM4KBw1L .
I'm finally gathering metrics data from the GCE cluster runs (so that we have prometheus data that can at least tell us if we cross certain thresholds). I'm not sure however that default 0.99 is going to capture outliers though - we don't really know whether the 3m is an outlier (entire system goes into IO inversion and one container stops for 3 minutes) or whether it's better distributed. Hoping to get that with the latest metrics: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_gce/332/ (has links to .metrics files from each nodes that are the prometheus captures).
I did increase the GCE SSD sizes which gives us a small bump in IOP quota, but still seeing some throttles.
@vikaslaad and @mffiedler have started a new experiment today and we should have data soon-ish.
This may be https://github.com/openshift/origin/issues/12784 instead
On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com wrote:
https://ci.openshift.redhat.com/jenkins/job/test_pull_ requests_origin_future/48/
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-278789644, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L .
Even though 2m is an example of this problem
On Thu, Feb 9, 2017 at 5:21 PM, Clayton Coleman ccoleman@redhat.com wrote:
This may be https://github.com/openshift/origin/issues/12784 instead
On Thu, Feb 9, 2017 at 5:06 PM, Ben Parees notifications@github.com wrote:
https://ci.openshift.redhat.com/jenkins/job/test_pull_reques ts_origin_future/48/
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/11016#issuecomment-278789644, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p4o6rTu2bVddpvZGav0PW24PR7I4ks5ra43kgaJpZM4KBw1L .
Here's an example of where we waited a very long time and there were no running pods - this would have to be an incredibly slow layer creation to be the docker lock issue:
I don't see more than a few pods running on any node.
I did a run on overlayFS tonight. I still see:
Mar 6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.5 Latency:2m2.536474s}
Mar 6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m2.536474s}
Mar 6 22:42:09.205: INFO: {Operation:sync Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m2.536474s}
Mar 6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:51.995394s}
Mar 6 22:42:09.205: INFO: {Operation:stop_container Method:docker_operations_latency_microseconds Quantile:0.99 Latency:30.130964s}
Mar 6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.99 Latency:25.470511s}
Mar 6 22:42:09.205: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:23.547013s}
Mar 6 22:42:09.205: INFO: {Operation:SyncPod Method:container_manager_latency_microseconds Quantile:0.99 Latency:18.910722s}
Mar 6 22:42:09.205: INFO: {Operation:pull_image Method:docker_operations_latency_microseconds Quantile:0.9 Latency:17.537977s}
I'm still seeing IO limits being poked (consistent with devicemapper, actually, although that could just be the tests running faster or docker pulls still being the bulk of the workload).
@ncdc @sjenning @derekwaynecarr @jwhonce I should be able to give you repeatable clusters with overlay tomorrow - what do we need to do to verify that this is the CreateLayer previously fingered as the culprit, vs something more fundamental (say, the Docker daemon being saturated and the kubelet just giving up)? This is looking more like a kubelet / docker interaction.
xref: https://bugzilla.redhat.com/show_bug.cgi?id=1422511
we are hitting cases where docker list images is > 1min to complete, machine is not actually memory starved in those instances where this occurs.
This p0 has been open for 10 months, and inactive for 4. Is it actionable? If not, is it really a p0?
@liggitt I leave the priority to you to set/change. As to news on the issue, I finally have a reproducer that uses only docker and am working with the performance team to further isolate the cause of the slow downs.
Ops updating their monitoring https://trello.com/c/cB9lX5sb . Suspected cause is volume BurstBalance being exhausted.
@jwhonce could you give more details on the reproduction of the issue and is there any update on this?
@xThomo The reproducer used docker run in multiple threads which determined the BurstBalance on the volumes was exhausted. The Ops team is now implementing monitoring of that metric to determine if that is the root cause for these issues. Or, if there are other gating factors.
All other issues are separate now. Closing.
Our deployment test suite has started flaking more frequently due to deployer pods needing more time to become ready. We don't have any valuable data since all of our flakes are due to our tests being time-bounded.
See: https://github.com/openshift/origin/pull/11001 and related flakes: https://github.com/openshift/origin/issues/10951 https://github.com/openshift/origin/issues/11008 https://github.com/openshift/origin/issues/10989
cc: @derekwaynecarr @smarterclayton @mfojtik