Closed lukaszgielecinski closed 7 years ago
Can this now be closed? Did https://github.com/sky-uk/cqlmigrate/pull/37 fix this?
I don't think fix #37 addresses this issue when schema upgrades are required.
When a node is down, schema ugprades can still be performed although cassandra will issue a warning stating that a "schema version mismatch was detected". When the node comes back up the schema upgrade is applied to the node and the schema version is synced. Here is the message I got when updating the schema with 1 node down, replication factor of 3 in a 4-node cluster. I wonder if doing the same via the cqldriver would throw a specific exception for this
cirrus_admin@cqlsh:platform_tests> alter table mycqlmigratetable add colc text;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
OperationTimedOut: errors={}, last_host=10.50.98.145
New schema version applied to the 3 nodes
ubuntu@ip-10-50-99-79:~$ nodetool describecluster
Cluster Information:
Name: sandbox
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
1ef20cc1-bed4-32d3-a323-36b4b0a99e3e: [10.50.100.67, 10.50.98.145, 10.50.99.79]
UNREACHABLE: [10.50.102.20]
Schema version synced after restarting the down node:
ubuntu@ip-10-50-99-79:~$ nodetool describecluster
Cluster Information:
Name: sandbox
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
1ef20cc1-bed4-32d3-a323-36b4b0a99e3e: [10.50.100.67, 10.50.98.145, 10.50.102.20, 10.50.99.79]
@sebbonnet - in a previous comment @chbatey said "I am hesitant to allow schema changes when nodes are down. It really isn't a good idea and it'll allow people to shoot them selves in the foot."
PR #37 means we now don't check cluster health if there are no schema changes, which is a definite improvement.
Unless we disagree with @chbatey's comment, then presumably any further changes towards this issue are inadvisable?
@adamdougal sounds like we need to discuss this further with @chbatey and find out its reasons for being "hesitant". I searched on the web but could not find anything that would prove or disprove it's ok, so I've asked the question on the IRC channel. Granted the test I've done above is not realistic, so I'd like to try other on a cluster that's under load and bring up a node that is on a previous schema version.
From our perspective it can be closed, we approached the problem with a different solution.
I've asked the question a couple of times on IRC, but had no response... I've discussed this with @chbatey who said used to be quite a few bugs reported / unreported on this area where the node schemas would not be in agreement, so if we were to try we would need make sure nodes can recover from schema disagreement too. So doing schema upgrade with a node down is probably not worth the risk at this stage, especially given the fact that the majority of deployments will not need be impacted following #37
At the moment cqlmigrate requires that all nodes in all datacentres are up before it will run. We would like to make this check optional so that it will still run if a node is down so that the migration will be attempted with the supplied consistency level - which we will plan to set to LOCAL_QUORUM for read and EACH_QUORUM for write for our migrations.