sky-uk / cqlmigrate

Cassandra schema migration library
BSD 3-Clause "New" or "Revised" License
47 stars 29 forks source link

Travis Build Test Failure #69

Closed adamdougal closed 6 years ago

adamdougal commented 7 years ago

Builds are currently intermittently failing due to:

uk.sky.cqlmigrate.ClusterHealthTest > shouldThrowExceptionIfHostIsDown FAILED
    java.lang.NullPointerException at ClusterHealthTest.java:70

https://travis-ci.org/sky-uk/cqlmigrate/builds/279515435#L1065

adamdougal commented 6 years ago

After adding extra logs it looks like there is a race condition causing this to fail:

10:51:39.299 [Test worker] DEBUG uk.sky.cqlmigrate.ClusterHealthTest - Stopping Scassandra
10:51:39.325 [cluster20-reconnection-0] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
10:51:39.325 [cluster20-reconnection-0] DEBUG com.datastax.driver.core.Host.STATES - [Control connection] next reconnection attempt in 1000 ms
10:51:39.330 [cluster20-nio-worker-3] DEBUG com.datastax.driver.core.Connection - Connection[localhost/127.0.0.1:37299-4, inFlight=0, closed=true] closing connection
10:51:39.334 [Test worker] INFO  o.scassandra.server.ServerStubRunner - Server is shut down
10:51:39.334 [Test worker] DEBUG uk.sky.cqlmigrate.ClusterHealthTest - Stopped Scassandra
10:51:39.334 [Test worker] DEBUG uk.sky.cqlmigrate.ClusterHealthTest - Checking Cassandra
10:51:39.336 [Test worker] DEBUG uk.sky.cqlmigrate.ClusterHealth - Cassandra hosts: [localhost/127.0.0.1:37299]
10:51:39.352 [Test worker] DEBUG uk.sky.cqlmigrate.ClusterHealth - All Cassandra hosts healthy
10:51:39.352 [Test worker] DEBUG uk.sky.cqlmigrate.ClusterHealthTest - Checked Cassandra
10:51:39.352 [cluster20-worker-0] DEBUG com.datastax.driver.core.Host.STATES - [localhost/127.0.0.1:37299] marking host DOWN

Even though Scassandra has been stopped and the Cassandra driver is attempting to reconnected, it looks like the state isn't updated until after we've checked. As there's no other way of knowing the state of a node my proposed fix is to retry our check multiple times or wait before checking.