Closed smarterclayton closed 7 years ago
Hrmm, I don't remember seeing that error. I'm used to timeout and failed to propose issues
do we health check etcd when starting the server?
Hrmm, I don't remember seeing that error. I'm used to timeout and failed to propose issues
Could be a new error with the new client. Remember DANGER
in the rebase?
I think you're right. The extended jenkins tests aren't doing anything for the disk workaround
The Mystikal song Danger?
Well, we can see the error now and it looks to be the same "has no leader" with a comment: // TODO: make sure this is a no leader response
. I have no good ideas for how to solve this problem.
Since I have the link handy. This used to be caused by slowness here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/client/client.go#L294-L296. We'd probably be well served to enable and gather prometheus metrics during test-integration to see if that's still the spot with weird spikes.
There are three options I'm considering
Right now, integration tests are triggering this early the server start process. I haven't yet seen one during the actual test.
On Thu, Feb 25, 2016 at 10:29 AM, David Eads notifications@github.com wrote:
Sorry, this link: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/wal/wal.go#L398-L408
— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-188840879.
Other things to test - putting etcd on memory storage. I don't expect we need more than 1-20M of data in etcd.
On Thu, Feb 25, 2016 at 10:50 AM, Clayton Coleman ccoleman@redhat.com wrote:
There are three options I'm considering
- The known EBS sync delay is causing this - testable by a) retry causing the problem to go away, b) logs and SAR
- An unknown, smaller EBS sync delay causing this (due to random network delays in production) - testable by retry causing the problem to go away, SAR, and possibly looking at IOP budget
- A leader stability issue, or other internal issue that only occurs in the first few seconds of an etcd instance - testable by retry causing the problem to go away, logs in etcd
Right now, integration tests are triggering this early the server start process. I haven't yet seen one during the actual test.
On Thu, Feb 25, 2016 at 10:29 AM, David Eads notifications@github.com wrote:
Sorry, this link: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/wal/wal.go#L398-L408
— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-188840879.
Other things to test - putting etcd on memory storage. I don't expect we need more than 1-20M of data in etcd.
We do that for the merge queue and it rarely (never) hits this problem.
I think I've seen merge queue fail today on this problem, will check.
On Thu, Feb 25, 2016 at 10:54 AM, David Eads notifications@github.com wrote:
Other things to test - putting etcd on memory storage. I don't expect we need more than 1-20M of data in etcd.
We do that for the merge queue and it rarely (never) hits this problem.
— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-188851141.
But this is storage/testing, not the "start etcd on a ramdisk". It's using TMPDIR.
The health check logs look like this (for the swagger generation failing due to the api server not coming up in time):
I0225 11:34:15.728218 6100 panic.go:17] Process will terminate as soon as a panic occurs.
I0225 11:34:15.739334 6100 start_allinone.go:256] Starting profiling endpoint at http://127.0.0.1:6060/debug/pprof/
W0225 11:34:16.017350 6100 start_master.go:267] assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
W0225 11:34:16.017384 6100 start_master.go:267] assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console
I0225 11:34:16.727810 6100 plugins.go:71] No cloud provider specified.
I0225 11:34:16.864896 6100 interface.go:248] Default route transits interface "eth0"
I0225 11:34:16.865029 6100 interface.go:93] Interface eth0 is up
I0225 11:34:16.865078 6100 interface.go:138] Interface "eth0" has 2 addresses :[172.18.15.14/20 fe80::cfe:19ff:fe80:a44d/64].
I0225 11:34:16.865097 6100 interface.go:105] Checking addr 172.18.15.14/20.
I0225 11:34:16.865104 6100 interface.go:114] IP found 172.18.15.14
I0225 11:34:16.865113 6100 interface.go:144] valid IPv4 address for interface "eth0" found as 172.18.15.14.
I0225 11:34:16.865119 6100 interface.go:254] Choosing IP 172.18.15.14
I0225 11:34:16.865133 6100 master_config.go:222] Will report 172.18.15.14 as public IP address.
I0225 11:34:16.865153 6100 start_master.go:378] Starting master on 127.0.0.1:38443 (v1.1.3-284-g6be4a29)
I0225 11:34:16.865158 6100 start_master.go:379] Public master address is https://127.0.0.1:38443
I0225 11:34:16.865171 6100 start_master.go:383] Using images from "openshift/origin-<component>:latest"
I0225 11:34:17.006597 6100 server.go:64] etcd: peerTLS: cert = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.crt, key = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.key, ca = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/ca.crt, trusted-ca = , client-cert-auth = false
I0225 11:34:17.148984 6100 server.go:75] etcd: listening for peers on https://127.0.0.1:37001
I0225 11:34:17.149013 6100 server.go:86] etcd: clientTLS: cert = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.crt, key = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/etcd.server.key, ca = /tmp/openshift/generate-swagger-spec/openshift.local.config/master/ca.crt, trusted-ca = , client-cert-auth = false
I0225 11:34:17.289804 6100 server.go:100] etcd: listening for client requests on https://127.0.0.1:34001
# 15s here
I0225 11:34:33.477671 6100 run.go:61] Started etcd at 127.0.0.1:34001
I0225 11:34:33.649050 6100 run_components.go:189] Using default project node label selector:
I0225 11:34:34.064036 6100 genericapiserver.go:337] Setting GenericAPIServer service IP to "172.30.0.1" (read-write).
# 8s here
I0225 11:34:42.467293 6100 trace.go:57] Trace "etcdHelper::Create *api.Namespace" (started 2016-02-25 11:34:39.927644263 -0500 EST):
[127.066µs] [127.066µs] Object encoded
[133.519µs] [6.453µs] Version checked
[2.539418694s] [2.539285175s] Object created
[2.53960844s] [189.746µs] END
W0225 11:34:42.573052 6100 controller.go:262] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[]} [{[{172.18.15.14 <nil>}] [] [{https 38443 TCP} {dns 53 UDP} {dns-tcp 53 TCP}]}]}
I0225 11:34:42.788239 6100 plugin.go:27] Route plugin initialized with suffix=router.default.svc.cluster.local
I0225 11:34:43.016096 6100 master.go:258] Started Kubernetes API at 127.0.0.1:38443/api/v1
I0225 11:34:43.016129 6100 master.go:258] Started Kubernetes API Extensions at 127.0.0.1:38443/apis/extensions/v1beta1
I0225 11:34:43.016136 6100 master.go:258] Started Origin API at 127.0.0.1:38443/oapi/v1
I0225 11:34:43.016141 6100 master.go:258] Started OAuth2 API at 127.0.0.1:38443/oauth
I0225 11:34:43.016146 6100 master.go:258] Started Login endpoint at 127.0.0.1:38443/login
I0225 11:34:43.016151 6100 master.go:258] Started Web Console 127.0.0.1:38443/console/
I0225 11:34:43.016166 6100 master.go:258] Started Swagger Schema API at 127.0.0.1:38443/swaggerapi/
I0225 11:34:43.231929 6100 ensure.go:177] No cluster policy found. Creating bootstrap policy based on: /tmp/openshift/generate-swagger-spec/openshift.local.config/master/policy.json
I0225 11:34:43.232028 6100 decoder.go:141] decoding stream as JSON
This is immediately after an 8gb write. We had to wait almost 15 seconds for etcd to start (touching WAL files), 8 seconds for some of the early init code, 2s for the namespace to be created. This looks like something else - not disk, not leader stability, but very long CPU delay.
Or arbitrary network delay.
Started up an instance, did a clone of master, ran
OS_OUTPUT_GOPATH=1 TIMES=1000 hack/test-integration.sh TestImageStreamList
The integration test fails every ~20th iteration. This is not a disk issue - because no extra disk is being allocated (/tmp stays at about 100k of contents).
CPU load is roughly one core, averaged.
Actually, seeing this fail ~1/5 times.
Got debug trace:
D0225 20:21:24.765587 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/secrets?recursive=true&wait=true&waitIndex=111 remote:127.0.0.1:51108
E0225 20:21:24.766202 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
D0225 20:21:24.766984 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/secrets?recursive=true&wait=true&waitIndex=111 remote:127.0.0.1:51132
D0225 20:21:24.768134 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/openshift.io/ranges/uids?quorum=false&recursive=false&sorted=false remote:127.0.0.1:51078
D0225 20:21:24.768225 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/secrets?recursive=true&wait=true&waitIndex=111 remote:127.0.0.1:51094
E0225 20:21:24.768438 20443 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500
E0225 20:21:24.768438 20443 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500
E0225 20:21:24.768618 20443 serviceaccounts_controller.go:221] error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500
D0225 20:21:24.769874 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [PUT] /v2/keys/kubernetes.io/secrets/openshift-infra/build-controller-token-s8uy2?prevExist=false remote:127.0.0.1:51135
E0225 20:21:24.769943 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
D0225 20:21:24.770010 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [PUT] /v2/keys/openshift.io/ranges/uids?prevExist=false remote:127.0.0.1:51078
E0225 20:21:24.770058 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
D0225 20:21:24.770126 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/build-controller?quorum=false&recursive=false&sorted=false remote:127.0.0.1:51113
E0225 20:21:24.770237 20443 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500
E0225 20:21:24.770386 20443 create_dockercfg_secrets.go:107] error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500
D0225 20:21:24.770545 20443 http.go:84] 2016/02/25 20:21:24 etcdhttp: [PUT] /v2/keys/kubernetes.io/serviceaccounts/default/builder?prevExist=false remote:127.0.0.1:51135
E0225 20:21:24.770604 20443 client.go:601] 2016/02/25 20:21:24 etcdhttp: got unexpected response error (etcdserver: server stopped)
F0225 20:21:24.770710 20443 run_components.go:411] Unable to initialize namespaces: unable to persist the updated namespace UID allocations: error #0: client: etcd member http://127.0.0.1:35240 has no leader: 500
And another:
D0225 20:23:25.883343 20822 http.go:84] 2016/02/25 20:23:25 etcdhttp: [GET] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/build-controller?quorum=false&recursive=false&sorted=false remote:127.0.0.1:53233
D0225 20:23:25.883699 20822 http.go:84] 2016/02/25 20:23:25 etcdhttp: [PUT] /v2/keys/kubernetes.io/serviceaccounts/openshift-infra/build-controller?prevIndex=80&prevValue=%7B%22kind%22%3A%22ServiceAccount%22%2C%22apiVersion%22%3A%22v1%22%2C%22metadata%22%3A%7B%22name%22%3A%22build-controller%22%2C%22namespace%22%3A%22openshift-infra%22%2C%22uid%22%3A%229f0c5c5b-dbfd-11e5-ab9a-0e71621efd09%22%2C%22creationTimestamp%22%3A%222016-02-25T20%3A23%3A24Z%22%7D%7D%0A remote:127.0.0.1:53233
E0225 20:23:25.884595 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884668 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884729 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884786 20822 client.go:601] 2016/02/25 20:23:25 etcdhttp: got unexpected response error (etcdserver: server stopped)
E0225 20:23:25.884896 20822 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500
E0225 20:23:25.884946 20822 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500
F0225 20:23:25.885023 20822 run_components.go:411] Unable to initialize namespaces: unable to persist the updated namespace UID allocations: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500
E0225 20:23:25.904958 20822 errors.go:63] apiserver received an error that is not an unversioned.Status: error #0: client: etcd member http://127.0.0.1:57579 has no leader: 500
etcdserver: server stopped
means "propose took too long", which is commonly a result of "sync too long". See https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L565 .
You wouldn't think that etcd errors mean what you think they mean, would you?
I only see s.done getting closed on server.Stop, where are you seeing the propose time out result in it being closed?
I only see s.done getting closed on server.Stop, where are you seeing the propose time out result in it being closed?
Crossed versions in my environment. Retracted.
It looks like it's closed in a defer when run exits. Maybe it is panicking? Or otherwise falling out of the loop? Will dig into it later this evening, at least it is recreatable.
I believe I've fixed this.
Reopening to track what looks like the original issue, timeout. Going to test the try PR repeatedly to exercise failure
D0226 14:08:53.618717 32130 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/ranges/serviceips?quorum=false&recursive=false&sorted=false remote:127.0.0.1:49881
D0226 14:08:53.619004 32130 http.go:84] 2016/02/26 14:08:53 etcdhttp: [PUT] /v2/keys/kubernetes.io/ranges/serviceips?prevExist=false remote:127.0.0.1:49881
E0226 14:08:58.819189 32130 client.go:601] 2016/02/26 14:08:58 etcdhttp: got unexpected response error (etcdserver: request timed out)
F0226 14:08:58.819467 32130 controller.go:86] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: error #0: client: etcd member http://127.0.0.1:45797 has no leader
D0226 14:08:53.35119 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/ranges/servicenodeports?quorum=false&recursive=false&sorted=false remote:127.0.0.1:59043
D0226 14:08:53.35347 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/services/specs?quorum=false&recursive=true&sorted=true remote:127.0.0.1:59043
D0226 14:08:53.35572 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [GET] /v2/keys/kubernetes.io/ranges/servicenodeports?quorum=false&recursive=false&sorted=false remote:127.0.0.1:59043
D0226 14:08:53.35834 32170 http.go:84] 2016/02/26 14:08:53 etcdhttp: [PUT] /v2/keys/kubernetes.io/ranges/servicenodeports?prevExist=false remote:127.0.0.1:59043
E0226 14:08:58.236041 32170 client.go:601] 2016/02/26 14:08:58 etcdhttp: got unexpected response error (etcdserver: request timed out)
F0226 14:08:58.236267 32170 controller.go:90] Unable to perform initial service nodePort check: unable to persist the updated port allocations: error #0: client: etcd member http://127.0.0.1:43067 has no leader
Request timed out now.
Hey look, EBS docs:
As with standard EBS volumes, there is up to a 50 percent reduction in IOPS when you first access the data. Performance is restored after the data is accessed once. For maximum performance consistency with new volumes, we recommend that you read or write to all the blocks on your volume before you use it for a workload.
Well... that explains that...
iowait is ~10-15s. I'm going to increase the integration test request timeout for etcd to 20s
It doesn't concern you to run our tests with such a fundamental parameter changed so much from the "real world"?
It's an arbitrary timeout in a unit test which is about proving correctness of code, which apparently is << an actual real world stall that has no impact on production workloads. I can't think of a case in which we could suffer from turning this up, because it would mean etcd is taking longer than a certain amount of time to commit a very simple transaction, and at worst we would trigger a higher level timeout.
So... no.
It doesn't concern you to run our tests with such a fundamental parameter changed so much from the "real world"?
It bothers me. Do you want to try pre-allocating the whole thing? I don't recall what we had it set to before.
More than just raw feelings, can you articulate how the lengthening of this particular timeout is going to affect the reliability of our tests? Even a couple of justifications.
On Mon, Feb 29, 2016 at 7:10 AM, David Eads notifications@github.com wrote:
It doesn't concern you to run our tests with such a fundamental parameter changed so much from the "real world"?
It bothers me. Do you want to try pre-allocating the whole thing? I don't recall what we had it set to before.
— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-190203736.
This is an indication that we have a significant performance problem in a pretty normal environment. If our default settings don't allow reliable behavior on stock environments, we can hardly expect reliable behavior when users run it. This is an indication that we need to:
We have 0 indication this is anything other than normal behavior of the EBS system until you write to all blocks, coupled with an arbitrarily low timeout, in an environment where we have unrealistic (non real world) expectations of zero failed calls.
If you expect zero failures in a test, then by definition an arbitrary timeout is the source of the problem.
I don't know how log writing to the entire disk takes but it's worth finding out. It needs to be tolerant to picking the correct disk size. Even if we did that, I'd still increase the timeout because there is no guarantee of bounded delay on network attached disk.
Not sure we have to write, just read: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html
It takes 3 minutes to read the entire disk. We have to do this for each instance we want to test to avoid read stalls. This is not write specific.
Nope, 37 minutes
Best fio seems to be able to do is 40 minutes (reading) because we're limited to 100 iops. Writing 8gb isn't enough. It might be possible to do randreads and writes for a probabilistic part of the disk (randomly write enough blocks) but my fio-fu does not extend that far. I'm going to merge the timeout increase in the meantime and we can roll it back if someone can figure out a way to make this faster.
...
I'm wondering whether ansible zeros the disk now...
On Mar 4, 2016, at 8:28 PM, Jordan Liggitt notifications@github.com wrote:
...
— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-192544067.
It looks like there are no stalls now in the tests (that I can see) - peaks of 40-60% iowait for up to a second or two, but that's it. It kills me to spend 12 minutes on this at the beginning of a build.
It looks like there are no stalls now in the tests (that I can see) - peaks of 40-60% iowait for up to a second or two, but that's it. It kills me to spend 12 minutes on this at the beginning of a build.
Ok, so now we have both the long etcd timeout and the pre-read? Why not create a smaller block device and use that. I wouldn't expect us to use more than 100 megs or so in etcd.
Yeah, we don't need to run our tests on the base volume (anything that has snapshots has this behavior, it's reads and not writes), we only need to run our tests on a drive. We could do memory (fast), block (slow), or a combination. We can shrink the size of the main drive, use local disk (if available), or split into two devices. But whatever device we use can't be based on a snapshot. Once we make that change, we can revert both this and the long wait.
On Mon, Mar 7, 2016 at 7:43 AM, David Eads notifications@github.com wrote:
It looks like there are no stalls now in the tests (that I can see) - peaks of 40-60% iowait for up to a second or two, but that's it. It kills me to spend 12 minutes on this at the beginning of a build.
Ok, so now we have both the long etcd timeout and the pre-read? Why not create a smaller block device and use that. I wouldn't expect us to use more than 100 megs or so in etcd.
— Reply to this email directly or view it on GitHub https://github.com/openshift/origin/issues/7243#issuecomment-193252567.
Fixed.
https://ci.openshift.redhat.com/jenkins/job/origin_extended/617/consoleFull
Is this the disk stall thing?