lecaillon / Evolve

Database migration tool for .NET and .NET Core projects. Inspired by Flyway.
https://evolve-db.netlify.com
MIT License
849 stars 111 forks source link

Cassandra: Add a cluster level lock to prevent any migration while the topology of the cluster is being modified #56

Closed Pvlerick closed 6 years ago

Pvlerick commented 6 years ago

Following a discussion with @doanduyhai, it appears that it is an extremely bad idea to make any schema changes while the topology of the cluster is being modified (more on this here: https://www.slideshare.net/doanduyhai/cassandra-nice-use-cases-and-worst-anti-patterns-no-sqlmatters-barcelona/44).

The best approach seems to create an applicative lock (implementing CassandraCluster.TryAcquireApplicationLock) that checks the existence of a predetermined keyspace and not execute migration if this item exists.

Using this, a system administrators could create that keyspace to prevent any Evolve migration from taking place (it would be wise to wait for all deployments to complete though since some migration could potentially be taking place still, the only guarantee is that no new Evolve migration is going to start), then make change to the topology and finally deleting the keyspace to release the lock and allow deployments again.

@doanduyhai, if you have any comments/suggestions ;-)

doanduyhai commented 6 years ago

Instead of creating/deleting the keyspace, you can just create a keyspace (ingenico_schema_change) and a table (lock) with the following schema:

CREATE TABLE lock(
   keyspace_to_lock text PRIMARY KEY
)

Before each topology change, do an INSERT with a TTL on this table (TTL value should be long enough to cover the duration of the topology change)

After the topology change, you issue an DELETE to remove the lock

Pvlerick commented 6 years ago

@doanduyhai ok, got it, thanks. One last question: any replication factor or consistency level advice on this?

doanduyhai commented 6 years ago

Standard replication factor of 3 should be fine.

For operations on lock/unlock since you're using LightWeightTransaction, the only choice of Consistency Level is LOCAL_SERIAL or SERIAL

LOCAL_SERIAL allows your lock operation to survive a crash of a whole datacenter. SERIAL allows your lock operation to span across the cluster because it requires strict majority of replica on all data centers

Pvlerick commented 6 years ago

Ok. Now, I was under the impression that this lock would be for the whole cluster - I guess what we want is to block any structural change in any keyspace in the cluster when the topology changes, so we don't even need a keyspace_to_lock column but only one.

In short, this table will contain a single record when a topology change is in progress (so no migration should occur) and no record when it is safe to migrate. Is that correct?

doanduyhai commented 6 years ago

so we don't even need a keyspace_to_lock column but only one

Yes correct. In this case to block schema change on the whole cluster, you need to use the consistency SERIAL instead of LOCAL_SERIAL for your LWT

lecaillon commented 6 years ago

@Pvlerick Is this always an openned issue ?