sky-uk / cqlmigrate

Cassandra schema migration library
BSD 3-Clause "New" or "Revised" License
47 stars 29 forks source link

Fail gracefully if lock quorum is not met at start of run #87

Closed davidh87 closed 4 years ago

davidh87 commented 4 years ago

We've run into this problem a few times in our dev environments. When the cluster is in a state that quorum will not be met for the CQL updates, CQLMigrate executes as follows:

While this behaviour is great, it means that if the cluster recovers itself (eg, an unavailable node is restarted) then it requires manual intervention to release locks explicitly before re-running the jobs. Recently the ability to specify lock consistency levels was added, but that has effectively the same behaviour, since the create-lock call will fail but the lock will still be inserted on nodes.

We'd like to suggest the following behaviour change to try and make cqlmigrate recover more gracefully:

In this scenario, if the locks consistency cannot be met then the locks read will fail, causing cqlmigrate to fail but no lock to be held, and the whole process to be very easily retryable. Should an error occur when obtaining locks or executing cql updates, the lock continues to be held in the same way today.

I'm unsure if this would be considered a breaking change, it's a change in behaviour such that a run may fail but no lock would be held, but from our perspective that would be an improved feature rather than fundamental behaviour change.

Assuming this is acceptable we're happy to raise a PR for the changes, but wanted to open it up to discussions first.

adamdougal commented 4 years ago

That sounds good to me. I don't think it's a breaking change that would require a major version bump like you say, it's an improvement on existing behaviour.

davidh87 commented 4 years ago

This was addressed and released in 0.10.3