Flaky restore behavior in a case of enospc on one of the nodes

mikliapko commented 6 days ago

I do the next test:

Create backup task (keyspace1) and wait for DONE status;
Reach enospc on one of the nodes;
Create restore task and wait for task final status

The results of such test is unstable. In general, there were three attempts:

Attempt 1 Restore task status PASS. Environment: Scylla 6.0 (tablets enabled), Manager 3.3.0-dev CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/5/

Attempt 2 Restore task status PASS. Environment: Scylla 6.0 (tablets disabled), Manager 3.3.0-dev CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/6/

Attempt 3 Restore task status ERROR. Environment: Scylla 6.0 (tablets enabled), Manager 3.3.0-dev CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/8/

Command "sudo sctool  -c 3970cebd-91ae-4b54-8209-60361b254972 progress restore/f58bc267-cc7b-4f52-a7e9-7d46363f9a96" finished with status 0

Restore progress
Run:        b3d0dda5-339e-11ef-b94d-026cea77bb6b
Status:     ERROR (restoring backed-up data)
Cause:      not restored bundles [3ghb_0ph8_0vi9c21pobsvuli4m0 3ghb_0ph8_12sls2wklal4ocf8ig]: create run progress: validate free disk space: not enough disk space
Start time: 26 Jun 24 09:30:18 UTC
End time:   26 Jun 24 09:36:50 UTC
Duration:   6m31s
Progress:   0% | 0%
Snapshot Tag:   sm_20240626092405UTC

╭───────────┬──────────┬────────┬─────────┬────────────┬────────╮
│ Keyspace  │ Progress │   Size │ Success │ Downloaded │ Failed │
├───────────┼──────────┼────────┼─────────┼────────────┼────────┤
│ keyspace1 │  0% | 0% │ 2.882G │       0 │          0 │      0 │
╰───────────┴──────────┴────────┴─────────┴────────────┴────────╯

Some pre-history: Before testing with Scylla release 6.0, we had a test that checks the restore performed in described conditions results with ERROR status. After testing with Scylla 6.0 we observed that the restore passes every time we execute the test. According to @karol-kokoszka, Manager checks available disk space on the nodes that are expected to participate in the restore. If it's below the limit, then the node won't participate in the restore. It looks that only one node reported not enough space. So the restore was performed by other (working) ones. Maybe there was some change in 6.0 that load&stream handles the situation better than 2024.1. Thus, it was decided to rework the test to expect successful restore when testing with Scylla 6.0, but during test validation after fixes the restore failed again.

Michal-Leszczynski commented 5 days ago

Some time ago we changed how SM returns errors that happened in parallel. Right now SM logs all of those errors, but returns only one of them in order to avoid printing really long, incomprehensible error message. What @karol-kokoszka wrote is true, so all hosts would need to encounter some error in order for the restore procedure to fail. We can see the failures in the logs:

"Failed to restore files on host","host":"10.12.0.15","error":"create run progress: validate free disk space: not enough disk space"
"Failed to restore files on host","host":"10.12.2.14","error":"restore batch: call load and stream: giving up after 10 attempts: agent [HTTP 500] Failed to load new sstables: std::system_error (error system:111, Connection refused)"
"Failed to restore files on host","host":"10.12.2.132","error":"restore batch: call load and stream: giving up after 10 attempts: agent [HTTP 500] Failed to load new sstables: std::system_error (error system:111, Connection refused)"

So one host failed due to the expected lack of disk space, but the other two encountered some connectivity issue. What is strange, is that those other hosts have managed to successfully restore some batches of sstables in the context of the previous restore.

@mikliapko why are there 2 restore runs in the logs?

Michal-Leszczynski commented 5 days ago

Also, logs of 10.12.2.14 contain the following lines:

3d5cfbfd-603c-4656-8476-67804d9a6be8 Error: AccessDeniedException: User: arn:aws:sts::797456418907:assumed-role/qa-scylla-manager-backup-role/i-08dcd5eb5ae1866d8 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:us-east-1:797456418907:instance/i-08dcd5eb5ae1866d8 because no identity-based policy allows the ssm:UpdateInstanceInformation action
2024-06-26T08:42:07.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | amazon-ssm-agent[375]:     status code: 400, request id: 3d5cfbfd-603c-4656-8476-67804d9a6be8
2024-06-26T08:42:07.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | amazon-ssm-agent[375]: 2024-06-26 08:42:06 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 797456418907
2024-06-26T08:42:07.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | amazon-ssm-agent[375]:     status code: 400, request id: 1fa12ab8-420b-4ab8-ac90-8c3210d1110e
2024-06-26T08:42:07.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | amazon-ssm-agent[375]: 2024-06-26 08:42:06 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management Err: error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 797456418907
...
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1  !WARNING | scylla[6532]:  [shard 0:strm] sstables_loader - load_and_stream: ops_uuid=e8ff5b7c-ec6c-4f84-94a3-f39eba0f1648, ks=keyspace1, table=standard1, send_phase, err=std::runtime_error (send_meta_data: got error from peer node=10.12.0.15)
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1  !WARNING | scylla[6532]:  [shard 0:strm] sstables_loader - send_meta_data: failed to process source from node=10.12.0.15, err=std::runtime_error (send_meta_data: got error code=-1 from node=10.12.0.15)
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1  !WARNING | scylla[6532]:  [shard 0:strm] sstables_loader - load_and_stream: ops_uuid=e8ff5b7c-ec6c-4f84-94a3-f39eba0f1648, ks=keyspace1, table=standard1, finish_phase, err=std::runtime_error (send_meta_data: got error code=-1 from node=10.12.0.15)
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0:strm] sstables_loader - load_and_stream: ops_uuid=e8ff5b7c-ec6c-4f84-94a3-f39eba0f1648, ks=keyspace1, table=standard1, target_node=10.12.0.15, num_partitions_sent=220, num_bytes_sent=116160
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0:strm] sstables_loader - load_and_stream: finished ops_uuid=e8ff5b7c-ec6c-4f84-94a3-f39eba0f1648, ks=keyspace1, table=standard1, partitions_processed=221 partitions, bytes_processed=116218 bytes, partitions_per_second=45015.484 partitions/s, bytes_per_second=22.5758 MiB/s, duration=0.004909422 s, status=failed
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0: gms] gossip - Got shutdown message from 10.12.0.15, received_generation=1719393818, local_generation=1719393818
2024-06-26T09:30:44.917+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0: gms] gossip - InetAddress b6bb9f6c-9386-422c-8a0f-9efa118ff8dc/10.12.0.15 is now DOWN, status = shutdown
2024-06-26T09:30:49.667+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0:main] raft_group_registry - marking Raft server b6bb9f6c-9386-422c-8a0f-9efa118ff8dc as dead for raft groups
2024-06-26T09:30:51.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0:strm] raft_group0 - gaining leadership
2024-06-26T09:30:51.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0:strm] raft_topology - start topology coordinator fiber
2024-06-26T09:30:51.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator d61cbece-89ae-439a-9f3c-90687367ae81
2024-06-26T09:30:51.167+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 0: gms] raft_topology - raft topology: Refreshing table load stats for DC us-east that has 3 endpoints
2024-06-26T09:30:51.167+00:00 manager-regression-manager--db-node-4c47f6f6-1  !WARNING | scylla[6532]:  [shard 0: gms] raft_topology - Found error while refreshing load stats for tablets: seastar::rpc::closed_error (connection is closed), retrying...
2024-06-26T09:30:52.417+00:00 manager-regression-manager--db-node-4c47f6f6-1   !NOTICE | sudo[9371]: scyllaadm : PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/coredumpctl -q --json=short
2024-06-26T09:30:52.417+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | sudo[9371]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=1000)
2024-06-26T09:30:52.417+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | sudo[9371]: pam_unix(sudo:session): session closed for user root
2024-06-26T09:31:10.417+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 1:strm] stream_session - [Stream #d37a3f48-ca9a-45d3-baa6-71006cf6ffda] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=2000640, received_partitions=2000565
2024-06-26T09:31:10.417+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 1:strm] sstables_loader - load_and_stream: ops_uuid=d37a3f48-ca9a-45d3-baa6-71006cf6ffda, ks=keyspace1, table=standard1, target_node=10.12.2.14, num_partitions_sent=2000565, num_bytes_sent=1056298320
2024-06-26T09:31:10.417+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla[6532]:  [shard 1:strm] sstables_loader - load_and_stream: finished ops_uuid=d37a3f48-ca9a-45d3-baa6-71006cf6ffda, ks=keyspace1, table=standard1, partitions_processed=2000565 partitions, bytes_processed=1056298320 bytes, partitions_per_second=78022.73 partitions/s, bytes_per_second=39.287567 MiB/s, duration=25.640799 s, status=succeeded
2024-06-26T09:31:10.417+00:00 manager-regression-manager--db-node-4c47f6f6-1  !WARNING | scylla[6532]:  [shard 0:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=true, status=failed: std::runtime_error (send_meta_data: got error code=-1 from node=10.12.0.15)
2024-06-26T09:31:10.417+00:00 manager-regression-manager--db-node-4c47f6f6-1     !INFO | scylla-manager-agent[6554]: {"L":"ERROR","T":"2024-06-26T09:31:10.120Z","N":"http","M":"POST /storage_service/sstables/keyspace1?cf=standard1&load_and_stream=true&primary_replica_only=true","from":"10.12.0.78:55708","status":500,"bytes":132,"duration":"25650ms","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\nmain.(*logEntry).Write\n\tgithub.com/scylladb/scylla-manager/v3/pkg/cmd/agent/log.go:53\nmain.newRouter.RequestLogger.RequestLogger.func5.1.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:54\nmain.newRouter.RequestLogger.RequestLogger.func5.1\n\tgithub.com/go-chi/chi/v5@v5.0.0/middleware/logger.go:58\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2166\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\tgithub.com/go-chi/chi/v5@v5.0.0/mux.go:87\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:3137\nnet/http.(*conn).serve\n\tnet/http/server.go:2039"}

So the root cause for restore failure might be either:

some temporary connectivity issue in AWS
the enospc node is still needed for performing load&stream - other nodes might want to send/receive some metadata from it or just send the data to it

Michal-Leszczynski commented 5 days ago

Before testing with Scylla release 6.0, we had a test that checks the restore performed in described conditions results with ERROR status.

@mikliapko do you have logs from those runs? It looks something indeed changed in the load&stream implementation, because I don't see any SM related issue here.

Also, if it's needed, the default disk space threshold for performing restore is 10%, but it can be configured in the scylla-manager.yaml file. I'm just wondering if it would be more useful to run this test with a node which as 5% free disk space (instead of 0%), as SM shouldn't use it for the restore, but it wouldn't block other nodes from running it.

mikliapko commented 5 days ago

So the root cause for restore failure might be either:

some temporary connectivity issue in AWS

the enospc node is still needed for performing load&stream - other nodes might want to send/receive some metadata from it or just send the data to it

Hm, in case of the cause 2 we might have experienced failing restore every time we execute the test which is not true.

At the same time, I can't be any temporary connectivity issue as several days in a row the test fails and that was an expected behavior in a test before I've faced successful restore a couple of times with Scylla 6.0 only.

mikliapko commented 5 days ago

@mikliapko do you have logs from those runs? It looks something indeed changed in the load&stream implementation, because I don't see any SM related issue here.

What specific logs do you need? I expect everything should be in job artifacts.

mikliapko commented 5 days ago

Also, if it's needed, the default disk space threshold for performing restore is 10%, but it can be configured in the scylla-manager.yaml file. I'm just wondering if it would be more useful to run this test with a node which as 5% free disk space (instead of 0%), as SM shouldn't use it for the restore, but it wouldn't block other nodes from running it.

As I see the approach used in this test fills 90% of free space. https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/utils/common.py#L2741

Michal-Leszczynski commented 5 days ago

As I see the approach used in this test fills 90% of free space.

That's strange, because SM logs report that the node had 0% free disk space:

Jun 26 09:30:24 manager-regression-manager--monitor-node-4c47f6f6-1 scylla-manager[10624]: {"L":"INFO","T":"2024-06-26T09:30:24.341Z","N":"restore","M":"Available disk space","host":"10.12.0.15","percent":0,"_trace_id":"2S0HgU2sRQqlSBbb9izh9Q"}

Michal-Leszczynski commented 5 days ago

As I see the approach used in this test fills 90% of free space.

Looking at the implementation, I think that it fills 90% of available, not free space, so perhaps that's why SM reports a different value.

Michal-Leszczynski commented 5 days ago

What specific logs do you need? I expect everything should be in job artifacts.

Some pre-history: Before testing with Scylla release 6.0, we had a test that checks the restore performed in described conditions results with ERROR status.

The logs from the jobs tested against older Scylla versions. I think that pasted pipeline has only the 6.0 tests.

mikliapko commented 5 days ago

What specific logs do you need? I expect everything should be in job artifacts.

Some pre-history: Before testing with Scylla release 6.0, we had a test that checks the restore performed in described conditions results with ERROR status.

The logs from the jobs tested against older Scylla versions. I think that pasted pipeline has only the 6.0 tests.

True. Here you can find the logs for Scylla 2024.1

mikliapko commented 5 days ago

As I see the approach used in this test fills 90% of free space.

Looking at the implementation, I think that it fills 90% of available, not free space, so perhaps that's why SM reports a different value.

But it still shouldn't result in 0% free space no matter how to fill 90% as I understand

Michal-Leszczynski commented 15 hours ago

From the 2024.1 logs we can see the 0% free disk space as well:

{"L":"INFO","T":"2024-07-01T20:36:49.780Z","N":"restore","M":"Available disk space","host":"10.15.1.222","percent":0,"_trace_id":"LySbxyP_Tz2bS0_W9NAubw"}

So one issue is definitely about the disk usage. I checked SM actually check the free space of the data dir, so /var/lib/scylla/data not the general /var/lib/scylla. Perhaps changing the approach_enospc function to:

allocate the dummy data in /var/lib/scylla/data - so that SM checks the same dir disk usage
allocate 0.9 * total - used instead of 0.9 * free - so that it leaves node with 10% free disk space regardless of current the current disk usage (actually, it would be better to allocate 0.88)

It would be interesting to see if this results in SM printing the correct disk usage in the logs.

Another problem is test flakiness. I checked if the failing execution overloads the raft topology coordinator node, but it doesn't seem to be the case.

@mikliapko could we check if applying mentioned changes to function overloading node change anything? In case SM correctly reported 0% free disk space, then I don't know if Scylla supports load&stream in a cluster with one node completely out of disk space (in the data dir) - it would need to be verified.

scylladb / scylla-manager

Flaky restore behavior in a case of enospc on one of the nodes #3907