Merge master into Cassandra feature branch

wallrj commented 6 years ago

Release note:

NONE

wallrj commented 6 years ago

/test e2e

munnerz commented 6 years ago

This appears to have failed due to

I1114 21:33:00.155] E1114 21:31:46.183817       8 leaderelection.go:258] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put endpoints navigator-controller)

I'm starting to think we should either invert the lookup to check for error messages we do care about, or make this a non-failing test (e.g. just print messages that don't watch the whitelist as an informational failure).

We really should only care about the fact the tests pass. What do you think?

munnerz commented 6 years ago

/retest /lgtm /approve

jetstack-ci-bot commented 6 years ago

/test all [submit-queue is verifying that this PR is safe to merge]

wallrj commented 6 years ago

I find it useful to fail on these errors while I'm working on the Cassandra stuff.
It's interesting how often leader election seems to fail.
And since there is only a single navigator controller it suggests that something is wrong with our test cluster.

The second error looks like a bug to me: https://github.com/kubernetes/kubernetes/issues/42701 which is supposed to be fixed. Perhaps it suggests that we need to be using a newer version of client-go?

I1114 21:33:00.155] E1114 21:31:46.183817       8 leaderelection.go:258] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put endpoints navigator-controller)
I1114 21:33:00.156] E1114 21:31:46.186329       8 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'nav-e2e-navigator-controller-1608401658-tztrs-external-navigator-controller stopped leading'

https://jetstack-build-infra.appspot.com/build/jetstack-logs/pr-logs/pull/jetstack_navigator/130/navigator-e2e-v1-7/56/

If you prefer to remove the unexpected log error check in master that's fine, but I'll keep it enabled in the feature branch until it gets merged to master.
I don't know how you'd invert the search. We'd need to know up front, all the important errors. How would we have known to look specifically for E1102 12:26:09.653605 1 event.go:200] Server rejected event in https://github.com/jetstack/navigator/issues/95 ?

munnerz commented 6 years ago

The thing is, those errors are transient. It's okay if they happen, so long as the request to acquire the leader lock succeeds eventually. That's the whole point in Kubernetes having retries built in, and we already timeout after a certain period of time if something fails so we will already fail if something isn't "as expected".

If the server rejects an event, the event recorder retries sending the event later too, in your final example error. A test to ensure the event is recorded, with a suitable timeout should pick this error up.

wallrj commented 6 years ago

Well I don't want to labour the point, but

that error ended cannot patch events in the namespace "navigator-e2e-database1" (will not retry!)
The failure to patch was caused by an RBAC configuration error in our helm chart. (the bug)
The event was a failure event, which is not easy to trigger in e2e test. (They're tested in unit tests though).
I maintain that the unexpected log error check is a useful belt and braces check which can reveal edge case bugs.
And I hope that the frequency of "normal" errors will reduce drastically when we have more powerful test infrastructure.

But if it becomes more trouble than it's worth, I'm happy to remove the check from master.

wallrj commented 6 years ago

That latest test failure was caused by #133

/test e2e v1.8

wallrj commented 6 years ago

/test e2e v1.7

munnerz commented 6 years ago

Looking at that most recent failure, it occurred at 00:11:52.077 according to the logs from the apiserver:

I1115 00:11:52.077] E1115 00:09:28.579761       5 leaderelection.go:224] error retrieving resource lock kube-system/navigator-controller: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints navigator-controller)

If you look at the logs from the build at around that time, we can see:

I1115 00:07:20.745] NAMESPACE: test-cassandracluster-1510704418-17903
I1115 00:07:20.745] STATUS: DEPLOYED
I1115 00:07:20.745] 
I1115 00:07:20.745] RESOURCES:
I1115 00:07:20.746] ==> v1alpha1/CassandraCluster
I1115 00:07:20.746] NAME                                  KIND
I1115 00:07:20.746] cassandra-1510704418-17903-cassandra  CassandraCluster.v1alpha1.navigator.jetstack.io
I1115 00:07:20.746] 
I1115 00:07:20.746] 
W1115 00:07:20.857] + retry TIMEOUT=300 kube_service_responding test-cassandracluster-1510704418-17903 cass-cassandra-1510704418-17903-cassandra 9042
W1115 00:07:20.857] + local TIMEOUT=60
W1115 00:07:20.857] + local SLEEP=10
W1115 00:07:20.857] + :
W1115 00:07:20.857] + case "${1}" in
W1115 00:07:20.857] + local TIMEOUT=300
W1115 00:07:20.857] + shift
W1115 00:07:20.857] + :
W1115 00:07:20.857] + case "${1}" in
W1115 00:07:20.857] + break
W1115 00:07:20.857] ++ date +%s
W1115 00:07:20.857] + local start_time=1510704440
W1115 00:07:20.857] + local end_time=1510704740
W1115 00:07:20.857] + true
W1115 00:07:20.857] + kube_service_responding test-cassandracluster-1510704418-17903 cass-cassandra-1510704418-17903-cassandra 9042
W1115 00:07:20.858] + local namespace=test-cassandracluster-1510704418-17903
W1115 00:07:20.858] + local host=cass-cassandra-1510704418-17903-cassandra
W1115 00:07:20.858] + local port=9042
W1115 00:07:20.858] + kubectl run --namespace=test-cassandracluster-1510704418-17903 kube-service-responding-20946 --stdin=true --rm --restart=Never --image alpine -- nc -w 5 -v cass-cassandra-1510704418-17903-cassandra 9042
W1115 00:10:24.891] ++ date +%s
W1115 00:10:24.897] + local current_time=1510704624
W1115 00:10:24.897] + local remaining_time=116
W1115 00:10:24.897] + [[ 116 -lt 0 ]]
W1115 00:10:24.897] + local sleep_time=10

There is a ~3m gap from the time after the cassandracluster is created until the first test pod does anything. The test then proceeds to pass.

I'd guess that the machine was under load for a few minutes whilst cassandra started up, and once it had started up (~2m, perhaps due to lack of memory) everything started to stabilise.

As part of #131, I increase the amount of memory given to the minikube VM to 4GB, however I still don't think we should be failing here. All other tests passed, which means that our software behaved as expected (despite taking longer).

munnerz commented 6 years ago

(worth noting that the first 5 or so attempts at running the test pod after that shown in the logs ^ also failed, so this was a timeout on the apiserver/things taking ages that caused the hang is my theory). That test passed at W1115 00:11:26.818

wallrj commented 6 years ago

/test e2e v1.7

jetstack-ci-bot commented 6 years ago

/lgtm cancel //PR changed after LGTM, removing LGTM. @munnerz @wallrj

wallrj commented 6 years ago

/test all

munnerz commented 6 years ago

/lgtm

munnerz commented 6 years ago

/test all

jetstack-ci-bot commented 6 years ago

/lgtm cancel //PR changed after LGTM, removing LGTM. @munnerz @wallrj

wallrj commented 6 years ago

/test e2e

munnerz commented 6 years ago

/lgtm

jetstack-ci-bot commented 6 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: munnerz

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files: - ~~[OWNERS](https://github.com/jetstack/navigator/blob/master/OWNERS)~~ [munnerz] You can indicate your approval by writing `/approve` in a comment You can cancel your approval by writing `/approve cancel` in a comment

jetstack-ci-bot commented 6 years ago

/test all [submit-queue is verifying that this PR is safe to merge]

jetstack-ci-bot commented 6 years ago

Automatic merge from submit-queue.

jetstack / navigator

Merge master into Cassandra feature branch #130