Integration tests sometimes hang and hit 2 hour timeout

openshift / origin

Conformance test suite for OpenShift

http://www.openshift.org

Apache License 2.0

8.5k stars 4.71k forks source link

Integration tests sometimes hang and hit 2 hour timeout #15093

Closed bparees closed 7 years ago

bparees commented 7 years ago

As seen here, the integration tests hung and got timed out after 2 hours (this is why we introduced the timeouts):

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_integration/4213

++ Building go targets for linux/amd64: test/integration/integration.test
hack/build-go.sh took 562 seconds

[INFO] hack/test-integration.sh exited with code 0 after 00h 33m 30s
hack/test-end-to-end.sh
++ Docker is installed, running hack/test-end-to-end-docker.sh instead.

[INFO] hack/test-end-to-end-docker.sh exited with code 0 after 01h 17m 08s
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace
########## FINISHED STAGE: FAILURE: RUN INTEGRATION TESTS [02h 00m 28s] ##########

test-end-to-end-docker.sh took 8 minutes in a clean run. so the problem seems to be there.

0xmichalis commented 7 years ago

Also, I am in favor of splitting those two tests into different jobs. I am more worried about the times our tests are taking than the actual number of the jobs we are running. As an example K8S is running at least 8 presubmits today.

0xmichalis commented 7 years ago

cc: @stevekuznetsov

bparees commented 7 years ago

Also, I am in favor of splitting those two tests into different jobs. I am more worried about the times our tests are taking than the actual number of the jobs we are running.

well again that test normally runs in 8 minutes, it's not taking a huge amount of time.

0xmichalis commented 7 years ago

well again that test normally runs in 8 minutes, it's not taking a huge amount of time.

Doesn't justify shoehorning it into another job. We would get a clearer sign if this was a different job as opposed to waiting 2+ hours. Even the timeout option is not granular enough in this case. Also. short-running jobs > long-running jobs.

stevekuznetsov commented 7 years ago

Right now the justification is that AWS EC2 charges by the hour, so 8min == 1h. Our costs would more than double if we split things out as I would like to. As we move forward with @csrwng Pod-based jobs I am confident we can break up into very bite-size things, to the point of each verify step being a job, etc. We're on GCE there which is billed per-minute so our quantization error is much smaller.

0xmichalis commented 7 years ago

The integration job today runs 1 ½ hour on clean runs (I hate that we don't have a graph with all these metrics) which means that if we could break the integration test down to less than an hour, then we would get no difference than today billing-wise.

stevekuznetsov commented 7 years ago

Yep, we could spend time doing that. We just have been conservative about it in the past for cost reasons and we did not take the time to switch them out after word just due to priorities. If you want to make the switch in aos-cd-jobs, sounds like it should be fine for me. The new job won't be 8min, though, as we will need to rebuild a release, but should be <1h.

smarterclayton commented 7 years ago

IT's more useful to make install_Update faster than it is to split out these jobs, because install_update is the slowest job in the queue.

On Fri, Jul 7, 2017 at 1:38 PM, Steve Kuznetsov notifications@github.com wrote:

Yep, we could spend time doing that. We just have been conservative about it in the past for cost reasons and we did not take the time to switch them out after word just due to priorities. If you want to make the switch in aos-cd-jobs, sounds like it should be fine for me. The new job won't be 8min, though, as we will need to rebuild a release, but should be <1h.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15093#issuecomment-313746725, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p6UeDBoxsdnjqrIUYfxy-JGVOtEfks5sLm0tgaJpZM4OQ-TA .

0xmichalis commented 7 years ago

Agreed about making install_update faster. Opened https://github.com/openshift/aos-cd-jobs/issues/409, https://github.com/openshift/aos-cd-jobs/issues/408, and https://github.com/openshift/aos-cd-jobs/issues/407

smarterclayton commented 7 years ago

The reason why integration is slow is because we have terrible code running in it. Spawning a separate issue.

smarterclayton commented 7 years ago

David split all this.