Closed wallrj closed 6 years ago
/test e2e
This appears to have failed due to
I1114 21:33:00.155] E1114 21:31:46.183817 8 leaderelection.go:258] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put endpoints navigator-controller)
I'm starting to think we should either invert the lookup to check for error messages we do care about, or make this a non-failing test (e.g. just print messages that don't watch the whitelist as an informational failure).
We really should only care about the fact the tests pass. What do you think?
/retest /lgtm /approve
/test all [submit-queue is verifying that this PR is safe to merge]
I find it useful to fail on these errors while I'm working on the Cassandra stuff.
It's interesting how often leader election seems to fail.
And since there is only a single navigator controller it suggests that something is wrong with our test cluster.
The second error looks like a bug to me: https://github.com/kubernetes/kubernetes/issues/42701 which is supposed to be fixed. Perhaps it suggests that we need to be using a newer version of client-go?
I1114 21:33:00.155] E1114 21:31:46.183817 8 leaderelection.go:258] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put endpoints navigator-controller)
I1114 21:33:00.156] E1114 21:31:46.186329 8 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'nav-e2e-navigator-controller-1608401658-tztrs-external-navigator-controller stopped leading'
If you prefer to remove the unexpected log error check in master that's fine, but I'll keep it enabled in the feature branch until it gets merged to master.
I don't know how you'd invert the search. We'd need to know up front, all the important errors. How would we have known to look specifically for E1102 12:26:09.653605 1 event.go:200] Server rejected event
in https://github.com/jetstack/navigator/issues/95 ?
The thing is, those errors are transient. It's okay if they happen, so long as the request to acquire the leader lock succeeds eventually. That's the whole point in Kubernetes having retries built in, and we already timeout after a certain period of time if something fails so we will already fail if something isn't "as expected".
If the server rejects an event, the event recorder retries sending the event later too, in your final example error. A test to ensure the event is recorded, with a suitable timeout should pick this error up.
Well I don't want to labour the point, but
cannot patch events in the namespace "navigator-e2e-database1" (will not retry!)
But if it becomes more trouble than it's worth, I'm happy to remove the check from master.
That latest test failure was caused by #133
/test e2e v1.8
/test e2e v1.7
Looking at that most recent failure, it occurred at 00:11:52.077 according to the logs from the apiserver:
I1115 00:11:52.077] E1115 00:09:28.579761 5 leaderelection.go:224] error retrieving resource lock kube-system/navigator-controller: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints navigator-controller)
If you look at the logs from the build at around that time, we can see:
I1115 00:07:20.745] NAMESPACE: test-cassandracluster-1510704418-17903
I1115 00:07:20.745] STATUS: DEPLOYED
I1115 00:07:20.745]
I1115 00:07:20.745] RESOURCES:
I1115 00:07:20.746] ==> v1alpha1/CassandraCluster
I1115 00:07:20.746] NAME KIND
I1115 00:07:20.746] cassandra-1510704418-17903-cassandra CassandraCluster.v1alpha1.navigator.jetstack.io
I1115 00:07:20.746]
I1115 00:07:20.746]
W1115 00:07:20.857] + retry TIMEOUT=300 kube_service_responding test-cassandracluster-1510704418-17903 cass-cassandra-1510704418-17903-cassandra 9042
W1115 00:07:20.857] + local TIMEOUT=60
W1115 00:07:20.857] + local SLEEP=10
W1115 00:07:20.857] + :
W1115 00:07:20.857] + case "${1}" in
W1115 00:07:20.857] + local TIMEOUT=300
W1115 00:07:20.857] + shift
W1115 00:07:20.857] + :
W1115 00:07:20.857] + case "${1}" in
W1115 00:07:20.857] + break
W1115 00:07:20.857] ++ date +%s
W1115 00:07:20.857] + local start_time=1510704440
W1115 00:07:20.857] + local end_time=1510704740
W1115 00:07:20.857] + true
W1115 00:07:20.857] + kube_service_responding test-cassandracluster-1510704418-17903 cass-cassandra-1510704418-17903-cassandra 9042
W1115 00:07:20.858] + local namespace=test-cassandracluster-1510704418-17903
W1115 00:07:20.858] + local host=cass-cassandra-1510704418-17903-cassandra
W1115 00:07:20.858] + local port=9042
W1115 00:07:20.858] + kubectl run --namespace=test-cassandracluster-1510704418-17903 kube-service-responding-20946 --stdin=true --rm --restart=Never --image alpine -- nc -w 5 -v cass-cassandra-1510704418-17903-cassandra 9042
W1115 00:10:24.891] ++ date +%s
W1115 00:10:24.897] + local current_time=1510704624
W1115 00:10:24.897] + local remaining_time=116
W1115 00:10:24.897] + [[ 116 -lt 0 ]]
W1115 00:10:24.897] + local sleep_time=10
There is a ~3m gap from the time after the cassandracluster is created until the first test pod does anything. The test then proceeds to pass.
I'd guess that the machine was under load for a few minutes whilst cassandra started up, and once it had started up (~2m, perhaps due to lack of memory) everything started to stabilise.
As part of #131, I increase the amount of memory given to the minikube VM to 4GB, however I still don't think we should be failing here. All other tests passed, which means that our software behaved as expected (despite taking longer).
(worth noting that the first 5 or so attempts at running the test pod after that shown in the logs ^ also failed, so this was a timeout on the apiserver/things taking ages that caused the hang is my theory). That test passed at W1115 00:11:26.818
/test e2e v1.7
/lgtm cancel //PR changed after LGTM, removing LGTM. @munnerz @wallrj
/test all
/lgtm
/test all
/lgtm cancel //PR changed after LGTM, removing LGTM. @munnerz @wallrj
/test e2e
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: munnerz
The full list of commands accepted by this bot can be found here.
/test all [submit-queue is verifying that this PR is safe to merge]
Automatic merge from submit-queue.
Release note: