improbable-eng / etcd-cluster-operator

A controller to deploy and manage etcd clusters inside of Kubernetes
MIT License
128 stars 35 forks source link

Set resource requests and limits on Etcd peer containers #125

Closed wallrj closed 4 years ago

wallrj commented 4 years ago

Fixes: #115

wallrj commented 4 years ago

E2E tests failing with this error, which might be caused by my allocating too little memory:

==== START logs for container etcd of pod teste2e-parallel-samplecluster/my-cluster-0-f4vr4 ====
2020-01-08 10:41:03.799604 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://my-cluster-0.my-cluster.teste2e-parallel-samplecluster.svc:2379
2020-01-08 10:41:03.799668 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
2020-01-08 10:41:03.799696 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://my-cluster-0.my-cluster.teste2e-parallel-samplecluster.svc:2380
2020-01-08 10:41:03.799707 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=my-cluster-0=http://my-cluster-0.my-cluster.teste2e-parallel-samplecluster.svc:2380,my-cluster-1=http://my-cluster-1.my-cluster.teste2e-parallel-samplecluster.svc:2380,my-cluster-2=http://my-cluster-2.my-cluster.teste2e-parallel-samplecluster.svc:2380
2020-01-08 10:41:03.799714 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2020-01-08 10:41:03.799720 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=my-cluster
2020-01-08 10:41:03.799729 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
2020-01-08 10:41:03.799736 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2020-01-08 10:41:03.799751 I | pkg/flags: recognized and used environment variable ETCD_NAME=my-cluster-0
2020-01-08 10:41:03.799837 I | etcdmain: etcd Version: 3.2.27
2020-01-08 10:41:03.799844 I | etcdmain: Git SHA: bdd97d5ff
2020-01-08 10:41:03.799856 I | etcdmain: Go Version: go1.8.7
2020-01-08 10:41:03.799860 I | etcdmain: Go OS/Arch: linux/amd64
2020-01-08 10:41:03.799865 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 2
2020-01-08 10:41:03.799924 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-01-08 10:41:03.799991 I | embed: listening for peers on http://0.0.0.0:2380
2020-01-08 10:41:03.800023 I | embed: listening for client requests on 0.0.0.0:2379
2020-01-08 10:41:03.801395 I | etcdserver: name = my-cluster-0
2020-01-08 10:41:03.801412 I | etcdserver: data dir = /var/lib/etcd
2020-01-08 10:41:03.801417 I | etcdserver: member dir = /var/lib/etcd/member
2020-01-08 10:41:03.801421 I | etcdserver: heartbeat = 100ms
2020-01-08 10:41:03.801427 I | etcdserver: election = 1000ms
2020-01-08 10:41:03.801431 I | etcdserver: snapshot count = 100000
2020-01-08 10:41:03.801443 I | etcdserver: advertise client URLs = http://my-cluster-0.my-cluster.teste2e-parallel-samplecluster.svc:2379
2020-01-08 10:41:04.133382 C | etcdserver: read wal error (cannot allocate memory) and cannot be repaired
==== END logs for container etcd of pod teste2e-parallel-samplecluster/my-cluster-0-f4vr4 ====

https://841-210613496-gh.circle-artifacts.com/0/tmp/etcd-e2e/teste2e-parallel-samplecluster/teste2e-parallel-samplecluster/my-cluster-0-f4vr4/logs.txt

But weirdly, at the same time, on my-cluster-1, I see this error:

2020-01-08 10:41:01.950426 I | etcdserver/membership: added member 20d8126a621b3638 [http://my-cluster-5.my-cluster.teste2e-parallel-samplecluster.svc:2380] to cluster 1ae6192c6d6cc02f
2020-01-08 10:41:01.950453 I | rafthttp: starting peer 20d8126a621b3638...
2020-01-08 10:41:01.950477 I | rafthttp: started HTTP pipelining with peer 20d8126a621b3638
2020-01-08 10:41:01.951503 I | rafthttp: started peer 20d8126a621b3638
2020-01-08 10:41:01.951811 I | rafthttp: added peer 20d8126a621b3638
2020-01-08 10:41:02.049842 I | rafthttp: started streaming with peer 20d8126a621b3638 (writer)
2020-01-08 10:41:02.049889 I | rafthttp: started streaming with peer 20d8126a621b3638 (writer)
2020-01-08 10:41:02.050096 I | rafthttp: started streaming with peer 20d8126a621b3638 (stream MsgApp v2 reader)
2020-01-08 10:41:02.050627 I | rafthttp: started streaming with peer 20d8126a621b3638 (stream Message reader)
2020-01-08 10:41:03.599703 I | raft: c1586ccb976e37c7 is starting a new election at term 23
2020-01-08 10:41:03.599936 I | raft: c1586ccb976e37c7 became candidate at term 24
2020-01-08 10:41:03.599966 I | raft: c1586ccb976e37c7 received MsgVoteResp from c1586ccb976e37c7 at term 24
2020-01-08 10:41:03.599978 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 20d8126a621b3638 at term 24
2020-01-08 10:41:03.599990 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 51938e7f648adbc2 at term 24
2020-01-08 10:41:03.600001 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 8eca1236bfa86a0a at term 24
2020-01-08 10:41:03.600014 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 54a9a667fad5f561 at term 24
2020-01-08 10:41:04.699658 I | raft: c1586ccb976e37c7 is starting a new election at term 24
2020-01-08 10:41:04.699715 I | raft: c1586ccb976e37c7 became candidate at term 25
2020-01-08 10:41:04.699729 I | raft: c1586ccb976e37c7 received MsgVoteResp from c1586ccb976e37c7 at term 25
2020-01-08 10:41:04.699739 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 20d8126a621b3638 at term 25
2020-01-08 10:41:04.699748 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 51938e7f648adbc2 at term 25
2020-01-08 10:41:04.699761 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 8eca1236bfa86a0a at term 25
2020-01-08 10:41:04.699769 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 54a9a667fad5f561 at term 25
2020-01-08 10:41:05.729952 I | rafthttp: peer 54a9a667fad5f561 became active
2020-01-08 10:41:05.729989 I | rafthttp: established a TCP streaming connection with peer 54a9a667fad5f561 (stream Message reader)
2020-01-08 10:41:05.730829 I | rafthttp: established a TCP streaming connection with peer 54a9a667fad5f561 (stream MsgApp v2 reader)
2020-01-08 10:41:06.099608 I | raft: c1586ccb976e37c7 is starting a new election at term 25
2020-01-08 10:41:06.099642 I | raft: c1586ccb976e37c7 became candidate at term 26
2020-01-08 10:41:06.099652 I | raft: c1586ccb976e37c7 received MsgVoteResp from c1586ccb976e37c7 at term 26
2020-01-08 10:41:06.099662 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 51938e7f648adbc2 at term 26
2020-01-08 10:41:06.099671 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 8eca1236bfa86a0a at term 26
2020-01-08 10:41:06.099680 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 54a9a667fad5f561 at term 26
2020-01-08 10:41:06.099688 I | raft: c1586ccb976e37c7 [logterm: 13, index: 19] sent MsgVote request to 20d8126a621b3638 at term 26
2020-01-08 10:41:06.102582 I | raft: c1586ccb976e37c7 received MsgVoteResp from 54a9a667fad5f561 at term 26
2020-01-08 10:41:06.102655 I | raft: c1586ccb976e37c7 [quorum:3] has received 2 MsgVoteResp votes and 0 vote rejections
2020-01-08 10:41:06.832924 W | rafthttp: health check for peer 51938e7f648adbc2 could not connect: <nil> (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-08 10:41:06.839102 W | rafthttp: health check for peer 8eca1236bfa86a0a could not connect: <nil> (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-08 10:41:06.858292 W | rafthttp: health check for peer 51938e7f648adbc2 could not connect: <nil> (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-08 10:41:06.925322 W | rafthttp: health check for peer 8eca1236bfa86a0a could not connect: <nil> (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-08 10:41:07.051490 W | rafthttp: health check for peer 20d8126a621b3638 could not connect: dial tcp: lookup my-cluster-5.my-cluster.teste2e-parallel-samplecluster.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-08 10:41:07.051544 W | rafthttp: health check for peer 20d8126a621b3638 could not connect: dial tcp: lookup my-cluster-5.my-cluster.teste2e-parallel-samplecluster.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-08 10:41:07.199734 I | raft: c1586ccb976e37c7 is starting a new election at term 26

https://841-210613496-gh.circle-artifacts.com/0/tmp/etcd-e2e/teste2e-parallel-samplecluster/teste2e-parallel-samplecluster/my-cluster-1-q8jvw/logs.txt

Which talks about my-cluster-5 (the sixth peer name), but this is a 5-node cluster....so that seems like a bug.

wallrj commented 4 years ago

Here are the controller logs:

2020-01-08T10:40:19.794Z    ERROR   controller-runtime.controller   Reconciler error    {"controller": "etcdcluster", "request": "teste2e-parallel-samplecluster/my-cluster", "error": "failed to add new member: unable to add member to etcd cluster: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://my-cluster.teste2e-parallel-samplecluster.svc:2379 has no leader\n", "errorCauses": [{"error": "failed to add new member: unable to add member to etcd cluster: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://my-cluster.teste2e-parallel-samplecluster.svc:2379 has no leader\n"}]}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2020-01-08T10:40:22.587Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-4"}
2020-01-08T10:40:24.946Z    DEBUG   controllers.EtcdCluster Too few members for expected replica count, adding new member.  {"cluster": "teste2e-parallel-samplecluster/my-cluster", "expected-replicas": 5, "actual-members": 4, "peer-name": "my-cluster-5", "peer-url": "http://my-cluster-5.my-cluster.teste2e-parallel-samplecluster.svc:2380"}
2020-01-08T10:40:24.956Z    DEBUG   controller-runtime.manager.events   Normal  {"object": {"kind":"EtcdCluster","namespace":"teste2e-parallel-samplecluster","name":"my-cluster","uid":"1ba50642-439b-462b-8eaa-709102b77312","apiVersion":"etcd.improbable.io/v1alpha1","resourceVersion":"1174"}, "reason": "MemberAdded", "message": "Added a new member with name \"my-cluster-5\""}
2020-01-08T10:40:35.027Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/mutate-etcd-improbable-io-v1alpha1-etcdpeer", "UID": "ead5bacf-0cb9-438b-801c-c8a8001cf0a3", "kind": "etcd.improbable.io/v1alpha1, Kind=EtcdPeer", "resource": {"group":"etcd.improbable.io","version":"v1alpha1","resource":"etcdpeers"}}
2020-01-08T10:40:35.028Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/mutate-etcd-improbable-io-v1alpha1-etcdpeer", "UID": "ead5bacf-0cb9-438b-801c-c8a8001cf0a3", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2020-01-08T10:40:35.031Z    DEBUG   controller-runtime.webhook.webhooks received request    {"webhook": "/validate-etcd-improbable-io-v1alpha1-etcdpeer", "UID": "1cbbfe9e-305d-434a-95f4-1f866de080ad", "kind": "etcd.improbable.io/v1alpha1, Kind=EtcdPeer", "resource": {"group":"etcd.improbable.io","version":"v1alpha1","resource":"etcdpeers"}}
2020-01-08T10:40:35.031Z    DEBUG   controller-runtime.webhook.webhooks wrote response  {"webhook": "/validate-etcd-improbable-io-v1alpha1-etcdpeer", "UID": "1cbbfe9e-305d-434a-95f4-1f866de080ad", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2020-01-08T10:40:35.041Z    DEBUG   controllers.EtcdCluster Found member in etcd's API that has no EtcdPeer resource representation, added one. {"cluster": "teste2e-parallel-samplecluster/my-cluster", "member-name": "my-cluster-5"}
2020-01-08T10:40:35.056Z    DEBUG   controller-runtime.manager.events   Normal  {"object": {"kind":"EtcdCluster","namespace":"teste2e-parallel-samplecluster","name":"my-cluster","uid":"1ba50642-439b-462b-8eaa-709102b77312","apiVersion":"etcd.improbable.io/v1alpha1","resourceVersion":"1358"}, "reason": "PeerCreated", "message": "Created a new EtcdPeer with name \"my-cluster-5\""}
2020-01-08T10:40:35.063Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-5"}
2020-01-08T10:40:35.064Z    DEBUG   controllers.EtcdPeer    Replica set does not exist, creating    {"peer": "teste2e-parallel-samplecluster/my-cluster-5", "replica-set": "my-cluster-5"}
2020-01-08T10:40:35.074Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-5"}
2020-01-08T10:40:35.075Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-5"}
2020-01-08T10:40:35.124Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-5"}
2020-01-08T10:40:35.125Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-5"}
2020-01-08T10:40:47.868Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-3"}
2020-01-08T10:40:49.840Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-4"}
2020-01-08T10:40:51.861Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-2"}
2020-01-08T10:41:01.942Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-4"}
2020-01-08T10:41:05.023Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-0"}
2020-01-08T10:41:05.981Z    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "etcdpeer", "request": "teste2e-parallel-samplecluster/my-cluster-3"}

https://841-210613496-gh.circle-artifacts.com/0/tmp/etcd-e2e/teste2e-parallel-samplecluster/eco-system/eco-controller-manager-798b645f4c-khpq9/logs.txt

And they too show the controller adding a member called my-cluster-5.

wallrj commented 4 years ago

make test failed due to flakey test https://github.com/improbable-eng/etcd-cluster-operator/issues/132. Retrying