jetstack / navigator

Managed Database-as-a-Service (DBaaS) on Kubernetes
Apache License 2.0
271 stars 31 forks source link

Intermittent Cassandra readiness probe test failure #160

Closed wallrj closed 6 years ago

wallrj commented 6 years ago

E2E tests failed in https://github.com/jetstack/navigator/pull/155

https://jetstack-build-infra.appspot.com/build/jetstack-logs/pr-logs/pull/jetstack_navigator/155/navigator-e2e-v1-8/210/?log#log

# Node 0 cassandra process stopped
W1130 10:23:51.403] + kube_simulate_unresponsive_process test-cassandracluster-1512036443-25797 cass-cassandra-1512036443-25797-cassandra-ringnodes-0 cassandra
W1130 10:23:51.404] + local namespace=test-cassandracluster-1512036443-25797
W1130 10:23:51.404] + local pod=cass-cassandra-1512036443-25797-cassandra-ringnodes-0
W1130 10:23:51.404] + local container=cassandra
W1130 10:23:51.404] + kubectl --namespace=test-cassandracluster-1512036443-25797 exec cass-cassandra-1512036443-25797-cassandra-ringnodes-0 --container=cassandra -- kill -SIGSTOP -- -1

# Node 1cassandra process reacts
I1130 10:25:02.092] INFO  10:24:10 InetAddress /172.17.0.8 is now DOWN
I1130 10:25:02.092] INFO  10:24:11 Handshaking version with cass-cassandra-1512036443-25797-cassandra-ringnodes-0.cass-cassandra-1512036443-25797-cassandra-seedprovider.test-cassandracluster-1512036443-25797.svc.cluster.local/172.17.0.8

# Test fails ~1min later after 4 connection attempts 
W1130 10:24:48.339] + kubectl run cql-responding-6034 --namespace=test-cassandracluster-1512036443-25797 --command=true --image=cassandra:latest --restart=Never --rm --stdin=false --attach=true -- /usr/bin/cqlsh --debug cass-cassandra-1512036443-25797-cassandra-cql 9043
W1130 10:24:50.702] If you don't see a command prompt, try pressing enter.
W1130 10:24:55.951] Connection error: ('Unable to connect to any servers', ['172.17.0.8', '10.0.0.237'])
W1130 10:24:57.218] pod test-cassandracluster-1512036443-25797/cql-responding-6034 terminated (Error)
W1130 10:24:57.224] ++ date +%s
W1130 10:24:57.225] + local current_time=1512037497
W1130 10:24:57.225] + local remaining_time=-5
W1130 10:24:57.225] + [[ -5 -lt 0 ]]
W1130 10:24:57.226] + return 1
W1130 10:24:57.226] + fail_test 'Cassandra readiness probe failed to bypass dead node'

Either the failing readiness probe doesn't cause the CQL service to update its DNS records fast enough. Or the test doesn't wait long enough for the DNS to be updated.

/kind bug