scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 34 forks source link

Backup of single dc might skip token ranges from keyspace with SimpleStrategy #3922

Closed Michal-Leszczynski closed 1 month ago

Michal-Leszczynski commented 3 months ago

Validating that all token ranges of backed up table are replicated on at least 1 of the nodes used for backup is done only when there are some down nodes in backed up data centers.

    // Validate that there are enough live nodes to back up all tokens
    if len(liveNodes) < len(nodes) { // compares live nodes and all nodes from backed up data centers
        hosts := strset.New(liveNodes.Hosts()...)
        for _, ks := range target.Units {
            for _, tab := range ks.Tables {
                r := rings[ks.Keyspace+"."+tab]
                if r.Replication != scyllaclient.LocalStrategy {
                    for _, rt := range r.ReplicaTokens {
                        if !hosts.HasAny(rt.ReplicaSet...) {
                            return nil, errors.Errorf("not enough live nodes to backup keyspace %s", ks.Keyspace)
                        }
                    }
                }
            }
        }
        }

This makes sense for NetworkTopologyStrategy, but not for SimpleStrategy.

Let's say we have 2 dcs 3 nodes each and we want to backup the whole dc1. Even when there is no user keyspace with SimpleStrategy (as it is not recommended), some of the system keyspaces (e.g. system_distributed) have the default SimpleStrategy with RF: 3. This means, that running sctool backup --dc dc1 would result in only partial backup of system_distributed keyspace (some missing rows), as the replica set consisting of nodes from dc2 wouldn't make it into the backup.

So backup should fail if some keyspace might be only partially (entirely missing some token ranges) backed up. The problem is about backward compatibility - always running this check could break previously working backups defined like sctool backup --dc dc1, which could result in some alerts and not backing up the more important user data. Fixing it from user POV is simple (exclude problematic keyspace or change its replication strategy), but it requires manual action.

Possible solutions:

@tzach what do you think about this case?

Michal-Leszczynski commented 2 months ago

@karol-kokoszka going with the spirit of https://github.com/scylladb/scylla-manager/issues/3995, I believe that we should put correctness first, and backward compatibility second, so I added this issue to the 3.3.2 milestone. It should also be backported to all supported versions.

karol-kokoszka commented 2 months ago

so I added this issue to the 3.3.2 milestone.

Sure, if you think it's an easy fix, then let's include that.