sky-uk / cqlmigrate

Cassandra schema migration library
BSD 3-Clause "New" or "Revised" License
47 stars 29 forks source link

Cluster is considered unhealthy if some nodes are unreachable #48

Closed slavaboiko closed 6 years ago

slavaboiko commented 7 years ago

The ClusterHealth check is considered unhealthy if some nodes are unreachable even if the configured consistencyLevel can be satisfied.

Our app can is healthy and able to serve clients, but cqlmigrate prevents it from starting if the cluster is considered unhealthy.

jsravn commented 7 years ago

@v-boiko It should only do that check if there are new migrations. This is necessary since schema updates in cassandra generally require all nodes to be healthy to prevent data loss (we've had issues in the past with an old node disagreeing on schema). There has already been some discussion around this, see https://github.com/sky-uk/cqlmigrate/pull/35 and https://github.com/sky-uk/cqlmigrate/pull/37.

jsravn commented 7 years ago

I'm okay with adding an option to ignore health on schema migrate, but it is very much do-at-your-own-risk type of thing since it can lead to a few catastrophic situations. Better to have the whole cluster up if possible when doing schema changes - although I realise that is probably not viable for large clusters (>100 nodes).

jsravn commented 7 years ago

https://github.com/sky-uk/cqlmigrate/issues/42 is also related - cqlmigrate will erroneously treat dead nodes as unhealthy.

slavaboiko commented 7 years ago

Probably the approach is doing right now is valid. We checked and we didn't have the required migrations in the schema_updates table, so the library actually was trying to do something. Probably the best is just to sort out the cluster problems first.

Do you think we still need to acquire the lock then if the cluster is unhealthy?

jsravn commented 7 years ago

It isn't necessary to acquire if unhealthy, but the same thing is accomplished by ensuring we unlock if an exception is thrown.

jsravn commented 6 years ago

I believe this is fixed in https://github.com/sky-uk/cqlmigrate/pull/53.