scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Restore after DC removing fails with agent error `agent [HTTP 500] Failed to load new sstables`, Scylla 6.0 (tablets enabled) #3896

Open mikliapko opened 1 week ago

mikliapko commented 1 week ago

Preconditions:

Steps:

  1. Run decommision of node from second DC;
  2. Run restore operation: sctool restore -c 402eb19b-123f-4759-9fe3-12dba4cf6a02 --restore-tables --location s3:backup-bucket --snapshot-tag sm_20240620175913UTC
  3. Check restore operation status

Actual result: The operation status is ERROR:

Expected result: The operation status is DONE.

Environment:

Additional info:

Michal-Leszczynski commented 1 week ago

@mikliapko I would say that this looks like a Scylla related error. When doing a restore, SM successfully downloads sstables to node 127.0.81.1, but calling load&stream on it with primary_replica_only results in the following error. There is a chance that this is expected with raft topology enabled, but I don't know for sure.

@kbr-scylla does Scylla 6.0 supports the scenario mentioned above?

kbr-scylla commented 1 week ago

@kbr-scylla does Scylla 6.0 supports the scenario mentioned above?

I don't know. Since it was supported before, then... probably yes?

Does this restore task do something with schema, or only restores data (assuming that the necessary keyspace/tables are already created)?

If it's only for data (since you mentioned load&stream) -- then my primary suspect here would be tablets, because tablets significantly change how data is replicated across the cluster. You mentioned tablets are enabled -- I assume that keyspace you're trying to restore into has tablets enabled?

If answer for all is yes -- I would direct the question to people dealing with load&stream and/or with tablets.

git blame sstables_loader.cc points to @xemul @raphaelsc @bhalevy

kbr-scylla commented 1 week ago

node2 logs show:

WARN  2024-06-21 13:02:10,008 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=4661748c-4ba8-46ac-9e50-201a51e18be0, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

so at first glance it looks like a bug in load&stream implementation

edit: node1 the same

WARN  2024-06-21 12:59:44,676 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=c98b7a6c-af58-4d13-8c38-f296fb950708, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)
kbr-scylla commented 1 week ago

There is a chance that this is expected with raft topology enabled, but I don't know for sure.

Where does your suspicion come from? Raft topology should have nothing to do with it.

You could potentially start your cluster with force_gossip_topology_changes (we left in this option for testing gossip-based topology) and try to restore into such cluster if you suspect Raft topology is the cause

bhalevy commented 1 week ago

node2 logs show:

WARN  2024-06-21 13:02:10,008 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=4661748c-4ba8-46ac-9e50-201a51e18be0, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

so at first glance it looks like a bug in load&stream implementation

edit: node1 the same

WARN  2024-06-21 12:59:44,676 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=c98b7a6c-af58-4d13-8c38-f296fb950708, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

cc @asias @denesb

denesb commented 1 week ago

err=std::system_error (error system:22, bind: Invalid argument)

This looks like a seastar bug.

denesb commented 1 week ago

Looking at https://man7.org/linux/man-pages/man2/bind.2.html

According to the internet, 22 is EINVAL, which in the case of bind() means:

addrlen is wrong, or addr is not a valid address for this socket's domain.

Michal-Leszczynski commented 1 week ago

Where does your suspicion come from? Raft topology should have nothing to do with it.

I was suspecting that perhaps raft doesn't work well when nodes from 1 out of 2 DCs in the cluster are decommissioned, but it looks like it's not the case here. Thanks for taking a look at it!

Michal-Leszczynski commented 6 days ago

@denesb @asias should we move this issue to the scylla repo?

denesb commented 5 days ago

@denesb @asias should we move this issue to the scylla repo?

Yes.

Michal-Leszczynski commented 5 days ago

@mykaul could you please transfer this issue to the scylla repo (I don't have the permissions)?

tchaikov commented 1 hour ago

@mykaul hi Yaniv, ping?