Open mikliapko opened 1 week ago
@mikliapko I would say that this looks like a Scylla related error.
When doing a restore, SM successfully downloads sstables to node 127.0.81.1
, but calling load&stream on it with primary_replica_only
results in the following error. There is a chance that this is expected with raft topology enabled, but I don't know for sure.
@kbr-scylla does Scylla 6.0 supports the scenario mentioned above?
@kbr-scylla does Scylla 6.0 supports the scenario mentioned above?
I don't know. Since it was supported before, then... probably yes?
Does this restore task do something with schema, or only restores data (assuming that the necessary keyspace/tables are already created)?
If it's only for data (since you mentioned load&stream) -- then my primary suspect here would be tablets, because tablets significantly change how data is replicated across the cluster. You mentioned tablets are enabled -- I assume that keyspace you're trying to restore into has tablets enabled?
If answer for all is yes -- I would direct the question to people dealing with load&stream and/or with tablets.
git blame sstables_loader.cc
points to @xemul @raphaelsc @bhalevy
node2 logs show:
WARN 2024-06-21 13:02:10,008 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=4661748c-4ba8-46ac-9e50-201a51e18be0, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)
so at first glance it looks like a bug in load&stream implementation
edit: node1 the same
WARN 2024-06-21 12:59:44,676 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=c98b7a6c-af58-4d13-8c38-f296fb950708, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)
There is a chance that this is expected with raft topology enabled, but I don't know for sure.
Where does your suspicion come from? Raft topology should have nothing to do with it.
You could potentially start your cluster with force_gossip_topology_changes
(we left in this option for testing gossip-based topology) and try to restore into such cluster if you suspect Raft topology is the cause
node2 logs show:
WARN 2024-06-21 13:02:10,008 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=4661748c-4ba8-46ac-9e50-201a51e18be0, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)
so at first glance it looks like a bug in load&stream implementation
edit: node1 the same
WARN 2024-06-21 12:59:44,676 [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=c98b7a6c-af58-4d13-8c38-f296fb950708, ks=ks, table=cf1, send_phase, err=std::system_error (error system:22, bind: Invalid argument)
cc @asias @denesb
err=std::system_error (error system:22, bind: Invalid argument)
This looks like a seastar bug.
Looking at https://man7.org/linux/man-pages/man2/bind.2.html
According to the internet, 22 is EINVAL
, which in the case of bind()
means:
addrlen is wrong, or addr is not a valid address for this socket's domain.
Where does your suspicion come from? Raft topology should have nothing to do with it.
I was suspecting that perhaps raft doesn't work well when nodes from 1 out of 2 DCs in the cluster are decommissioned, but it looks like it's not the case here. Thanks for taking a look at it!
@denesb @asias should we move this issue to the scylla repo?
@denesb @asias should we move this issue to the scylla repo?
Yes.
@mykaul could you please transfer this issue to the scylla repo (I don't have the permissions)?
@mykaul hi Yaniv, ping?
Preconditions:
Steps:
sctool restore -c 402eb19b-123f-4759-9fe3-12dba4cf6a02 --restore-tables --location s3:backup-bucket --snapshot-tag sm_20240620175913UTC
Actual result: The operation status is ERROR:
not restored bundles [3gh5_1dyp_0wd4g2jf2xnbid1e2x 3gh5_1dyp_0wd4g215xa2kyrkjgp]: restore batch: call load and stream: giving up after 10 attempts: agent [HTTP 500] Failed to load new sstables: std::system_error (error system:22, bind: Invalid argument)
Expected result: The operation status is DONE.
Environment:
Additional info: