openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.48k stars 4.7k forks source link

client: etcd cluster is unavailable or misconfigured #7243

Closed smarterclayton closed 7 years ago

smarterclayton commented 8 years ago

https://ci.openshift.redhat.com/jenkins/job/origin_extended/617/consoleFull

Is this the disk stall thing?

[INFO] OpenShift server health checks done at: 
Thu Feb 11 20:37:15 EST 2016
[INFO] Installing the registry
deploymentconfig "docker-registry" created
service "docker-registry" created
[INFO] Waiting for command to finish: '[[ "$(oc get endpoints docker-registry --output-version=v1 --template="{{ if .subsets }}{{ len .subsets }}{{ else }}0{{ end }}" --config=${ADMIN_KUBECONFIG} || echo "0")" != "0" ]]'...
Error from server: endpoints "docker-registry" not found
[INFO] Success running command: '[[ "$(oc get endpoints docker-registry --output-version=v1 --template="{{ if .subsets }}{{ len .subsets }}{{ else }}0{{ end }}" --config=${ADMIN_KUBECONFIG} || echo "0")" != "0" ]]' after 7 seconds
[INFO] Installing the router
serviceaccount "router" created
securitycontextconstraints "privileged" replaced
[INFO] Generating router TLS certificate
password for stats user admin has been set to KrLgnvWlAb
deploymentconfig "router" created
service "router" created
[INFO] Creating image streams
imagestream "nodejs" created
imagestream "perl" created
imagestream "php" created
imagestream "python" created
imagestream "wildfly" created
imagestream "mysql" created
imagestream "postgresql" created
imagestream "mongodb" created
imagestream "jenkins" created
Error from server: client: etcd cluster is unavailable or misconfigured
!!! Error in test/extended/core.sh:74
    'oc create -n openshift -f examples/image-streams/image-streams-centos7.json --config="${ADMIN_KUBECONFIG}"' exited with status 1
Call stack:
    1: test/extended/core.sh:74 main(...)
Exiting with status 1
liggitt commented 8 years ago

Hrmm, I don't remember seeing that error. I'm used to timeout and failed to propose issues

mfojtik commented 8 years ago

do we health check etcd when starting the server?

deads2k commented 8 years ago

Hrmm, I don't remember seeing that error. I'm used to timeout and failed to propose issues

Could be a new error with the new client. Remember DANGER in the rebase?

liggitt commented 8 years ago

I think you're right. The extended jenkins tests aren't doing anything for the disk workaround

smarterclayton commented 8 years ago

The Mystikal song Danger?

deads2k commented 8 years ago

Well, we can see the error now and it looks to be the same "has no leader" with a comment: // TODO: make sure this is a no leader response. I have no good ideas for how to solve this problem.

deads2k commented 8 years ago

Since I have the link handy. This used to be caused by slowness here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/client/client.go#L294-L296. We'd probably be well served to enable and gather prometheus metrics during test-integration to see if that's still the spot with weird spikes.

deads2k commented 8 years ago

Sorry, this link: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/wal/wal.go#L398-L408

smarterclayton commented 8 years ago

There are three options I'm considering

  1. The known EBS sync delay is causing this - testable by a) retry causing the problem to go away, b) logs and SAR
  2. An unknown, smaller EBS sync delay causing this (due to random network delays in production) - testable by retry causing the problem to go away, SAR, and possibly looking at IOP budget
  3. A leader stability issue, or other internal issue that only occurs in the first few seconds of an etcd instance - testable by retry causing the problem to go away, logs in etcd

Right now, integration tests are triggering this early the server start process. I haven't yet seen one during the actual test.

On Thu, Feb 25, 2016 at 10:29 AM, David Eads notifications@github.com wrote:

Sorry, this link: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/wal/wal.go#L398-L408

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-188840879.

smarterclayton commented 8 years ago

Other things to test - putting etcd on memory storage. I don't expect we need more than 1-20M of data in etcd.

On Thu, Feb 25, 2016 at 10:50 AM, Clayton Coleman ccoleman@redhat.com wrote:

There are three options I'm considering

  1. The known EBS sync delay is causing this - testable by a) retry causing the problem to go away, b) logs and SAR
  2. An unknown, smaller EBS sync delay causing this (due to random network delays in production) - testable by retry causing the problem to go away, SAR, and possibly looking at IOP budget
  3. A leader stability issue, or other internal issue that only occurs in the first few seconds of an etcd instance - testable by retry causing the problem to go away, logs in etcd

Right now, integration tests are triggering this early the server start process. I haven't yet seen one during the actual test.

On Thu, Feb 25, 2016 at 10:29 AM, David Eads notifications@github.com wrote:

Sorry, this link: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/wal/wal.go#L398-L408

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-188840879.

deads2k commented 8 years ago

Other things to test - putting etcd on memory storage. I don't expect we need more than 1-20M of data in etcd.

We do that for the merge queue and it rarely (never) hits this problem.

smarterclayton commented 8 years ago

I think I've seen merge queue fail today on this problem, will check.

On Thu, Feb 25, 2016 at 10:54 AM, David Eads notifications@github.com wrote:

Other things to test - putting etcd on memory storage. I don't expect we need more than 1-20M of data in etcd.

We do that for the merge queue and it rarely (never) hits this problem.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-188851141.

smarterclayton commented 8 years ago

https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5104/consoleFull#72516821056bdfb7fe4b05b79524e5734

smarterclayton commented 8 years ago

But this is storage/testing, not the "start etcd on a ramdisk". It's using TMPDIR.

smarterclayton commented 8 years ago

The health check logs look like this (for the swagger generation failing due to the api server not coming up in time):

I0225 11:34:15.728218    6100 panic.go:17] Process will terminate as soon as a panic occurs.
I0225 11:34:15.739334    6100 start_allinone.go:256] Starting profiling endpoint at http://127.0.0.1:6060/debug/pprof/
W0225 11:34:16.017350    6100 start_master.go:267] assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
W0225 11:34:16.017384    6100 start_master.go:267] assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console
I0225 11:34:16.727810    6100 plugins.go:71] No cloud provider specified.
I0225 11:34:16.864896    6100 interface.go:248] Default route transits interface "eth0"
I0225 11:34:16.865029    6100 interface.go:93] Interface eth0 is up
I0225 11:34:16.865078    6100 interface.go:138] Interface "eth0" has 2 addresses :[172.18.15.14/20 fe80::cfe:19ff:fe80:a44d/64].
I0225 11:34:16.865097    6100 interface.go:105] Checking addr  172.18.15.14/20.
I0225 11:34:16.865104    6100 interface.go:114] IP found 172.18.15.14
I0225 11:34:16.865113    6100 interface.go:144] valid IPv4 address for interface "eth0" found as 172.18.15.14.
I0225 11:34:16.865119    6100 interface.go:254] Choosing IP 172.18.15.14 
I0225 11:34:16.865133    6100 master_config.go:222] Will report 172.18.15.14 as public IP address.
I0225 11:34:16.865153    6100 start_master.go:378] Starting master on 127.0.0.1:38443 (v1.1.3-284-g6be4a29)
I0225 11:34:16.865158    6100 start_master.go:379] Public master address is https://127.0.0.1:38443
I0225 11:34:16.865171    6100 start_master.go:383] Using images from "openshift/origin-<component>:latest"
I0225 11:34:17.006597    6100 server.go:64] etcd: peerTLS: cert = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.crt, key = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.key, ca = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/ca.crt, trusted-ca = , client-cert-auth = false
I0225 11:34:17.148984    6100 server.go:75] etcd: listening for peers on https://127.0.0.1:37001
I0225 11:34:17.149013    6100 server.go:86] etcd: clientTLS: cert = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.crt, key = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.key, ca = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/ca.crt, trusted-ca = , client-cert-auth = false
I0225 11:34:17.289804    6100 server.go:100] etcd: listening for client requests on https://127.0.0.1:34001

# 15s here

I0225 11:34:33.477671    6100 run.go:61] Started etcd at 127.0.0.1:34001
I0225 11:34:33.649050    6100 run_components.go:189] Using default project node label selector: 
I0225 11:34:34.064036    6100 genericapiserver.go:337] Setting GenericAPIServer service IP to "172.30.0.1" (read-write).

# 8s here

I0225 11:34:42.467293    6100 trace.go:57] Trace "etcdHelper::Create *api.Namespace" (started 2016-02-25 11:34:39.927644263 -0500 EST):
[127.066µs] [127.066µs] Object encoded
[133.519µs] [6.453µs] Version checked
[2.539418694s] [2.539285175s] Object created
[2.53960844s] [189.746µs] END
W0225 11:34:42.573052    6100 controller.go:262] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[]} [{[{172.18.15.14 <nil>}] [] [{https 38443 TCP} {dns 53 UDP} {dns-tcp 53 TCP}]}]}
I0225 11:34:42.788239    6100 plugin.go:27] Route plugin initialized with suffix=router.default.svc.cluster.local
I0225 11:34:43.016096    6100 master.go:258] Started Kubernetes API at 127.0.0.1:38443/api/v1
I0225 11:34:43.016129    6100 master.go:258] Started Kubernetes API Extensions at 127.0.0.1:38443/apis/extensions/v1beta1
I0225 11:34:43.016136    6100 master.go:258] Started Origin API at 127.0.0.1:38443/oapi/v1
I0225 11:34:43.016141    6100 master.go:258] Started OAuth2 API at 127.0.0.1:38443/oauth
I0225 11:34:43.016146    6100 master.go:258] Started Login endpoint at 127.0.0.1:38443/login
I0225 11:34:43.016151    6100 master.go:258] Started Web Console 127.0.0.1:38443/console/
I0225 11:34:43.016166    6100 master.go:258] Started Swagger Schema API at 127.0.0.1:38443/swaggerapi/
I0225 11:34:43.231929    6100 ensure.go:177] No cluster policy found.  Creating bootstrap policy based on: /tmp/openshift/generate-swagger-spec/openshift.local.config/master/policy.json
I0225 11:34:43.232028    6100 decoder.go:141] decoding stream as JSON

This is immediately after an 8gb write. We had to wait almost 15 seconds for etcd to start (touching WAL files), 8 seconds for some of the early init code, 2s for the namespace to be created. This looks like something else - not disk, not leader stability, but very long CPU delay.

smarterclayton commented 8 years ago

Or arbitrary network delay.

smarterclayton commented 8 years ago

Started up an instance, did a clone of master, ran

OS_OUTPUT_GOPATH=1 TIMES=1000 hack/test-integration.sh TestImageStreamList

The integration test fails every ~20th iteration. This is not a disk issue - because no extra disk is being allocated (/tmp stays at about 100k of contents).

smarterclayton commented 8 years ago

CPU load is roughly one core, averaged.

smarterclayton commented 8 years ago

Actually, seeing this fail ~1/5 times.

smarterclayton commented 8 years ago

Got debug trace:

D0225 20:21:24.765587 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/secrets?recursive=true&wait=true&waitIndex=111 remote:127.0.0.1:51108
E0225 20:21:24.766202 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
D0225 20:21:24.766984 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/secrets?recursive=true&wait=true&waitIndex=111 remote:127.0.0.1:51132
D0225 20:21:24.768134 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/openshift.io/ranges/uids?quorum=false&recursive=false&sorted=false remote:127.0.0.1:51078
D0225 20:21:24.768225 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/secrets?recursive=true&wait=true&waitIndex=111 remote:127.0.0.1:51094
E0225 20:21:24.768438   20443 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500 
E0225 20:21:24.768438   20443 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500 
E0225 20:21:24.768618   20443 serviceaccounts_controller.go:221] error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500 
D0225 20:21:24.769874 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [PUT] /v2/keys/kubernetes.io/secrets/openshift-infra/build-controller-token-s8uy2?prevExist=false remote:127.0.0.1:51135
E0225 20:21:24.769943 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
D0225 20:21:24.770010 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [PUT] /v2/keys/openshift.io/ranges/uids?prevExist=false remote:127.0.0.1:51078
E0225 20:21:24.770058 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
D0225 20:21:24.770126 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/build-controller?quorum=false&recursive=false&sorted=false remote:127.0.0.1:51113
E0225 20:21:24.770237   20443 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500 
E0225 20:21:24.770386   20443 create_dockercfg_secrets.go:107] error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500 
D0225 20:21:24.770545 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [PUT] /v2/keys/kubernetes.io/serviceaccounts/default/builder?prevExist=false remote:127.0.0.1:51135
E0225 20:21:24.770604 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
F0225 20:21:24.770710   20443 run_components.go:411] Unable to initialize namespaces: unable to persist the updated namespace UID allocations: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500 
smarterclayton commented 8 years ago

And another:

D0225 20:23:25.883343 20822 http.go:84] 2016/02/25 20:23:25 etcdhttp: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/build-controller?quorum=false&recursive=false&sorted=false remote:127.0.0.1:53233
D0225 20:23:25.883699 20822 http.go:84] 2016/02/25 20:23:25 etcdhttp: [PUT] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/build-controller?prevIndex=80&prevValue=%7B%22kind%22%3A%22ServiceAccount%22%2C%22apiVersion%22%3A%22v1%22%2C%22metadata%22%3A%7B%22name%22%3A%22build-controller%22%2C%22namespace%22%3A%22openshift-infra%22%2C%22uid%22%3A%229f0c5c5b-dbfd-11e5-ab9a-0e71621efd09%22%2C%22creationTimestamp%22%3A%222016-02-25T20%3A23%3A24Z%22%7D%7D%0A remote:127.0.0.1:53233
E0225 20:23:25.884595 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884668 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884729 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884786 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884896   20822 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500 
E0225 20:23:25.884946   20822 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500 
F0225 20:23:25.885023   20822 run_components.go:411] Unable to initialize namespaces: unable to persist the updated namespace UID allocations: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500 
E0225 20:23:25.904958   20822 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500 
deads2k commented 8 years ago

etcdserver: server stopped means "propose took too long", which is commonly a result of "sync too long". See https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L565 .

You wouldn't think that etcd errors mean what you think they mean, would you?

smarterclayton commented 8 years ago

I only see s.done getting closed on server.Stop, where are you seeing the propose time out result in it being closed?

deads2k commented 8 years ago

I only see s.done getting closed on server.Stop, where are you seeing the propose time out result in it being closed?

Crossed versions in my environment. Retracted.

smarterclayton commented 8 years ago

It looks like it's closed in a defer when run exits. Maybe it is panicking? Or otherwise falling out of the loop? Will dig into it later this evening, at least it is recreatable.

smarterclayton commented 8 years ago

I believe I've fixed this.

smarterclayton commented 8 years ago

Reopening to track what looks like the original issue, timeout. Going to test the try PR repeatedly to exercise failure

smarterclayton commented 8 years ago
D0226 14:08:53.618717 32130 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/ranges/serviceips?quorum=false&recursive=false&sorted=false remote:127.0.0.1:49881
D0226 14:08:53.619004 32130 http.go:84] 2016/02/26 14:08:53 etcdhttp: [PUT] /v2/keys/kubernetes.io/ranges/serviceips?prevExist=false remote:127.0.0.1:49881
E0226 14:08:58.819189 32130 client.go:601] 2016/02/26 14:08:58 etcdhttp: got unexpected response error (etcdserver: request timed out)
F0226 14:08:58.819467   32130 controller.go:86] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: error #0: client: etcd member http://127.0.0.1:45797 has no leader
D0226 14:08:53.35119 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/ranges/servicenodeports?quorum=false&recursive=false&sorted=false remote:127.0.0.1:59043
D0226 14:08:53.35347 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/services/specs?quorum=false&recursive=true&sorted=true remote:127.0.0.1:59043
D0226 14:08:53.35572 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/ranges/servicenodeports?quorum=false&recursive=false&sorted=false remote:127.0.0.1:59043
D0226 14:08:53.35834 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [PUT] /v2/keys/kubernetes.io/ranges/servicenodeports?prevExist=false remote:127.0.0.1:59043
E0226 14:08:58.236041 32170 client.go:601] 2016/02/26 14:08:58 etcdhttp: got unexpected response error (etcdserver: request timed out)
F0226 14:08:58.236267   32170 controller.go:90] Unable to perform initial service nodePort check: unable to persist the updated port allocations: error #0: client: etcd member http://127.0.0.1:43067 has no leader

Request timed out now.

smarterclayton commented 8 years ago

From the profile it looks like we had a ~9s iowait stall at 14:08. cpu.pdf and iops.pdf

smarterclayton commented 8 years ago

Hey look, EBS docs:

As with standard EBS volumes, there is up to a 50 percent reduction in IOPS when you first access the data. Performance is restored after the data is accessed once. For maximum performance consistency with new volumes, we recommend that you read or write to all the blocks on your volume before you use it for a workload.

Well... that explains that...

smarterclayton commented 8 years ago

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/12002/artifact/origin/artifacts/test-integration/logs/cpu.pdf

iowait is ~10-15s. I'm going to increase the integration test request timeout for etcd to 20s

liggitt commented 8 years ago

It doesn't concern you to run our tests with such a fundamental parameter changed so much from the "real world"?

smarterclayton commented 8 years ago

It's an arbitrary timeout in a unit test which is about proving correctness of code, which apparently is << an actual real world stall that has no impact on production workloads. I can't think of a case in which we could suffer from turning this up, because it would mean etcd is taking longer than a certain amount of time to commit a very simple transaction, and at worst we would trigger a higher level timeout.

So... no.

deads2k commented 8 years ago

It doesn't concern you to run our tests with such a fundamental parameter changed so much from the "real world"?

It bothers me. Do you want to try pre-allocating the whole thing? I don't recall what we had it set to before.

smarterclayton commented 8 years ago

More than just raw feelings, can you articulate how the lengthening of this particular timeout is going to affect the reliability of our tests? Even a couple of justifications.

On Mon, Feb 29, 2016 at 7:10 AM, David Eads notifications@github.com wrote:

It doesn't concern you to run our tests with such a fundamental parameter changed so much from the "real world"?

It bothers me. Do you want to try pre-allocating the whole thing? I don't recall what we had it set to before.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-190203736.

deads2k commented 8 years ago

This is an indication that we have a significant performance problem in a pretty normal environment. If our default settings don't allow reliable behavior on stock environments, we can hardly expect reliable behavior when users run it. This is an indication that we need to:

  1. Fix the default settings. If you think it should always be higher, make it higher.
  2. Reliably identify a specific cause. If we could definitively identify it with a prometheus metric or the like, we'd be able to troubleshoot the problem in the wild and have a reliable metric for re-kicking the individual test.
smarterclayton commented 8 years ago

We have 0 indication this is anything other than normal behavior of the EBS system until you write to all blocks, coupled with an arbitrarily low timeout, in an environment where we have unrealistic (non real world) expectations of zero failed calls.

If you expect zero failures in a test, then by definition an arbitrary timeout is the source of the problem.

I don't know how log writing to the entire disk takes but it's worth finding out. It needs to be tolerant to picking the correct disk size. Even if we did that, I'd still increase the timeout because there is no guarantee of bounded delay on network attached disk.

smarterclayton commented 8 years ago

Not sure we have to write, just read: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html

smarterclayton commented 8 years ago

It takes 3 minutes to read the entire disk. We have to do this for each instance we want to test to avoid read stalls. This is not write specific.

smarterclayton commented 8 years ago

Nope, 37 minutes

smarterclayton commented 8 years ago

Best fio seems to be able to do is 40 minutes (reading) because we're limited to 100 iops. Writing 8gb isn't enough. It might be possible to do randreads and writes for a probabilistic part of the disk (randomly write enough blocks) but my fio-fu does not extend that far. I'm going to merge the timeout increase in the meantime and we can roll it back if someone can figure out a way to make this faster.

smarterclayton commented 8 years ago

BAHAHAHA https://travis-ci.org/openshift/origin/jobs/113777883

liggitt commented 8 years ago

...

smarterclayton commented 8 years ago

I'm wondering whether ansible zeros the disk now...

On Mar 4, 2016, at 8:28 PM, Jordan Liggitt notifications@github.com wrote:

...

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-192544067.

smarterclayton commented 8 years ago

It looks like there are no stalls now in the tests (that I can see) - peaks of 40-60% iowait for up to a second or two, but that's it. It kills me to spend 12 minutes on this at the beginning of a build.

deads2k commented 8 years ago

It looks like there are no stalls now in the tests (that I can see) - peaks of 40-60% iowait for up to a second or two, but that's it. It kills me to spend 12 minutes on this at the beginning of a build.

Ok, so now we have both the long etcd timeout and the pre-read? Why not create a smaller block device and use that. I wouldn't expect us to use more than 100 megs or so in etcd.

smarterclayton commented 8 years ago

Yeah, we don't need to run our tests on the base volume (anything that has snapshots has this behavior, it's reads and not writes), we only need to run our tests on a drive. We could do memory (fast), block (slow), or a combination. We can shrink the size of the main drive, use local disk (if available), or split into two devices. But whatever device we use can't be based on a snapshot. Once we make that change, we can revert both this and the long wait.

On Mon, Mar 7, 2016 at 7:43 AM, David Eads notifications@github.com wrote:

It looks like there are no stalls now in the tests (that I can see) - peaks of 40-60% iowait for up to a second or two, but that's it. It kills me to spend 12 minutes on this at the beginning of a build.

Ok, so now we have both the long etcd timeout and the pre-read? Why not create a smaller block device and use that. I wouldn't expect us to use more than 100 megs or so in etcd.

— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-193252567.

smarterclayton commented 7 years ago

Fixed.