Restore after DC removing fails with agent error `agent [HTTP 500] Failed to load new sstables`, Scylla 6.0 (tablets enabled)

mikliapko commented 1 week ago

Preconditions:

2 DCs cluster (2 nodes + 1 node);
1 keyspace + 1 table (20 rows);
manager configured;
backup created and table truncated.

Steps:

Run decommision of node from second DC;
Run restore operation: sctool restore -c 402eb19b-123f-4759-9fe3-12dba4cf6a02 --restore-tables --location s3:backup-bucket --snapshot-tag sm_20240620175913UTC
Check restore operation status

Actual result: The operation status is ERROR:

Cause not restored bundles [3gh5_1dyp_0wd4g2jf2xnbid1e2x 3gh5_1dyp_0wd4g215xa2kyrkjgp]: restore batch: call load and stream: giving up after 10 attempts: agent [HTTP 500] Failed to load new sstables: std::system_error (error system:22, bind: Invalid argument)

Details:

E       AssertionError: Restore task failed: Restore progress
E         Run: d2186686-2f2e-11ef-9a91-16ffdc5b4c17
E         Status: ERROR (restoring backed-up data)
E         Cause: not restored bundles [3gh5_1dyp_0wd4g2jf2xnbid1e2x 3gh5_1dyp_0wd4g215xa2kyrkjgp]: restore batch: call load and stream: giving up after 10 attempts: agent [HTTP 500] Failed to load new sstables: std::system_error (error system:22, bind: Invalid argument)
E         Start time: 20 Jun 24 17:59:21 UTC
E         End time: 20 Jun 24 18:04:44 UTC
E         Duration: 5m23s
E         Progress: 0%
E         Snapshot Tag: sm_20240620175913UTC
E         Keyspace
E         ks

Expected result: The operation status is DONE.

Environment:

Scylla manager build: https://downloads.scylladb.com/manager/relocatable/unstable/branch-3.3/2024-06-21T12:54:49Z/scylla-manager_3.2.7-0.20240516.bca79c502-SNAPSHOT_linux_x86_64.tar.gz
ScyllaDB: 6.0 with tablets enabled.

Additional info:

The test passed for Scylla 2023.1;
Logs attached;
DTest - manager_restore_tests.py::TestScyllaMgmtRestore::test_restore_after_remove_dc.

Michal-Leszczynski commented 1 week ago

@mikliapko I would say that this looks like a Scylla related error. When doing a restore, SM successfully downloads sstables to node 127.0.81.1, but calling load&stream on it with primary_replica_only results in the following error. There is a chance that this is expected with raft topology enabled, but I don't know for sure.

@kbr-scylla does Scylla 6.0 supports the scenario mentioned above?

kbr-scylla commented 1 week ago

@kbr-scylla does Scylla 6.0 supports the scenario mentioned above?

I don't know. Since it was supported before, then... probably yes?

Does this restore task do something with schema, or only restores data (assuming that the necessary keyspace/tables are already created)?

If it's only for data (since you mentioned load&stream) -- then my primary suspect here would be tablets, because tablets significantly change how data is replicated across the cluster. You mentioned tablets are enabled -- I assume that keyspace you're trying to restore into has tablets enabled?

If answer for all is yes -- I would direct the question to people dealing with load&stream and/or with tablets.

git blame sstables_loader.cc points to @xemul @raphaelsc @bhalevy

kbr-scylla commented 1 week ago

node2 logs show:

WARN  2024-06-21 13:02:10,008 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=4661748c-4ba8-46ac-9e50-201a51e18be0, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

so at first glance it looks like a bug in load&stream implementation

edit: node1 the same

WARN  2024-06-21 12:59:44,676 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=c98b7a6c-af58-4d13-8c38-f296fb950708, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

kbr-scylla commented 1 week ago

There is a chance that this is expected with raft topology enabled, but I don't know for sure.

Where does your suspicion come from? Raft topology should have nothing to do with it.

You could potentially start your cluster with force_gossip_topology_changes (we left in this option for testing gossip-based topology) and try to restore into such cluster if you suspect Raft topology is the cause

bhalevy commented 1 week ago

node2 logs show:

WARN  2024-06-21 13:02:10,008 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=4661748c-4ba8-46ac-9e50-201a51e18be0, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

so at first glance it looks like a bug in load&stream implementation

edit: node1 the same

WARN  2024-06-21 12:59:44,676 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=c98b7a6c-af58-4d13-8c38-f296fb950708, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)

cc @asias @denesb

denesb commented 1 week ago

err=std::system_error (error system:22, bind: Invalid argument)

This looks like a seastar bug.

denesb commented 1 week ago

Looking at https://man7.org/linux/man-pages/man2/bind.2.html

According to the internet, 22 is EINVAL, which in the case of bind() means:

addrlen is wrong, or addr is not a valid address for this socket's domain.

Michal-Leszczynski commented 1 week ago

Where does your suspicion come from? Raft topology should have nothing to do with it.

I was suspecting that perhaps raft doesn't work well when nodes from 1 out of 2 DCs in the cluster are decommissioned, but it looks like it's not the case here. Thanks for taking a look at it!

Michal-Leszczynski commented 6 days ago

@denesb @asias should we move this issue to the scylla repo?

denesb commented 5 days ago

@denesb @asias should we move this issue to the scylla repo?

Yes.

Michal-Leszczynski commented 5 days ago

@mykaul could you please transfer this issue to the scylla repo (I don't have the permissions)?

tchaikov commented 1 hour ago

@mykaul hi Yaniv, ping?

scylladb / scylla-manager

Restore after DC removing fails with agent error `agent [HTTP 500] Failed to load new sstables`, Scylla 6.0 (tablets enabled) #3896