sky-uk / cqlmigrate

Cassandra schema migration library
BSD 3-Clause "New" or "Revised" License
47 stars 29 forks source link

Allow cluster health check to be optional #35

Closed lukaszgielecinski closed 7 years ago

lukaszgielecinski commented 8 years ago

At the moment cqlmigrate requires that all nodes in all datacentres are up before it will run. We would like to make this check optional so that it will still run if a node is down so that the migration will be attempted with the supplied consistency level - which we will plan to set to LOCAL_QUORUM for read and EACH_QUORUM for write for our migrations.

adamdougal commented 7 years ago

Can this now be closed? Did https://github.com/sky-uk/cqlmigrate/pull/37 fix this?

sebbonnet commented 7 years ago

I don't think fix #37 addresses this issue when schema upgrades are required.

When a node is down, schema ugprades can still be performed although cassandra will issue a warning stating that a "schema version mismatch was detected". When the node comes back up the schema upgrade is applied to the node and the schema version is synced. Here is the message I got when updating the schema with 1 node down, replication factor of 3 in a 4-node cluster. I wonder if doing the same via the cqldriver would throw a specific exception for this

cirrus_admin@cqlsh:platform_tests> alter table mycqlmigratetable add colc text;

Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
OperationTimedOut: errors={}, last_host=10.50.98.145

New schema version applied to the 3 nodes

ubuntu@ip-10-50-99-79:~$ nodetool describecluster
Cluster Information:
    Name: sandbox
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        1ef20cc1-bed4-32d3-a323-36b4b0a99e3e: [10.50.100.67, 10.50.98.145, 10.50.99.79]

        UNREACHABLE: [10.50.102.20]

Schema version synced after restarting the down node:

ubuntu@ip-10-50-99-79:~$ nodetool describecluster
Cluster Information:
    Name: sandbox
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        1ef20cc1-bed4-32d3-a323-36b4b0a99e3e: [10.50.100.67, 10.50.98.145, 10.50.102.20, 10.50.99.79]
adamdougal commented 7 years ago

@sebbonnet - in a previous comment @chbatey said "I am hesitant to allow schema changes when nodes are down. It really isn't a good idea and it'll allow people to shoot them selves in the foot."

PR #37 means we now don't check cluster health if there are no schema changes, which is a definite improvement.

Unless we disagree with @chbatey's comment, then presumably any further changes towards this issue are inadvisable?

sebbonnet commented 7 years ago

@adamdougal sounds like we need to discuss this further with @chbatey and find out its reasons for being "hesitant". I searched on the web but could not find anything that would prove or disprove it's ok, so I've asked the question on the IRC channel. Granted the test I've done above is not realistic, so I'd like to try other on a cluster that's under load and bring up a node that is on a previous schema version.

37 is definitely an improvement at it means most app deployments / restart will not be impacted by a node down.

lukaszgielecinski commented 7 years ago

From our perspective it can be closed, we approached the problem with a different solution.

sebbonnet commented 7 years ago

I've asked the question a couple of times on IRC, but had no response... I've discussed this with @chbatey who said used to be quite a few bugs reported / unreported on this area where the node schemas would not be in agreement, so if we were to try we would need make sure nodes can recover from schema disagreement too. So doing schema upgrade with a node down is probably not worth the risk at this stage, especially given the fact that the majority of deployments will not need be impacted following #37