scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Await schema ageement procedure is broken #3887

Closed Michal-Leszczynski closed 1 week ago

Michal-Leszczynski commented 1 week ago

There are 2 main problems with the current await schema agreement stage.

  1. It is bugged, as SM implementation uses 2 separate queries to get all schema versions. Those queries might be routed to different hosts in the cluster resulting in skipping schema_version from one node (the one to which the first query was routed):

    const (
        peerSchemasStmt = "SELECT schema_version FROM system.peers"
        localSchemaStmt = "SELECT schema_version FROM system.local WHERE key='local'"
    )
    
    return retry.WithNotify(ctx, func() error {
        var v []string
        if err := clusterSession.Query(peerSchemasStmt, nil).SelectRelease(&v); err != nil {
            return retry.Permanent(err)
        }
        var lv string
        if err := clusterSession.Query(localSchemaStmt, nil).GetRelease(&lv); err != nil {
            return retry.Permanent(err)
        }
  2. It works by checking schema agreement from just a single node POV - so in the same way as it is implemented in gocql. This is not enough in terms of ensuring that described schema is not older than the snapshoted data (see https://github.com/scylladb/scylladb/issues/19213#issue-2343849109 for more details).

In order to fix that, we could open separate, single host session to all backed up hosts and query their local system_schema version and await schema agreement on those values. The problem is that this is a passive way to await schema agreement - so if the schema changes keep on coming, it might never be achieved.

Another idea discussed in https://github.com/scylladb/scylladb/issues/19213 is to perform read barrier. It does not ensure schema agreement, but it ensures that the schema on the node on which the barrier was performed, is not older than the data snapshoted on all other nodes in the previous stages of the backup procedure - which is enough for backup purposes. Also, it is an active approach which should not take much time or retries.

The problem with this approach is that it won't work when the cluster has less then QUORUM nodes alive, as read barrier is a raft mechanism that doesn't work under such conditions. So when read barrier fails, SM won't backup schema, but it will proceed with uploading user data. In case there was a need to restore schema, it could still be done by the snapshoted system_schema sstables in a hacky/workaround way.

Perhaps in the future there will be Scylla API to issue read barrier, or querying DESC SCHEMA will do it on its own, but for now SM will go with the read barrier approach described above.

cc: @karol-kokoszka @tzach @mykaul