scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 33 forks source link

Backup command failed: `create local tmp directory: mkdir /tmp/scylla-manager-agent-197909592: no space left on device` #3243

Open ShlomiBalalis opened 1 year ago

ShlomiBalalis commented 1 year ago

Installation details

Manager: Client version: 3.0.0-0.20220523.5501e5d7f53 Server version: 3.0.0-0.20220523.5501e5d7f53

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e

Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0b6ff8cdcbe0cb88a (aws: us-east-1)

Test: longevity-lwt-500G-3d-test Test id: 4a98679f-02ad-4c38-a717-833dd12453de Test name: enterprise-2022.2/longevity/longevity-lwt-500G-3d-test Test config file(s):

Issue description

In this run, there were three backup tasks, with the first one being successful and the other two failing instantly upon creation. The first, successful one:

< t:2022-09-30 02:30:34,240 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --location s3:manager-backup-tests-us-east-1 " finished with status 0
< t:2022-09-30 02:30:34,240 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > sctool output: backup/a90c90b5-0a20-46c1-ad98-d159f8967bb6
...
< t:2022-09-30 04:48:41,676 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c e813a803-c47d-4640-8dd1-995e2d6dea4d progress backup/a90c90b5-0a20-46c1-ad98-d159f8967bb6" finished with status 0
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               dc2246d4-4067-11ed-902a-02610c34aa9d
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Start time:       30 Sep 22 02:30:33 UTC
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > End time: 30 Sep 22 04:48:37 UTC
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2h18m3s
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100%
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20220930023035UTC
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:      
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG >   - us-east
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ╭─────────────┬──────────┬──────────┬──────────┬──────────────┬────────╮
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ Host        │ Progress │     Size │  Success │ Deduplicated │ Failed │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ├─────────────┼──────────┼──────────┼──────────┼──────────────┼────────┤
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.0.98  │     100% │ 399.674G │ 399.674G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.1.240 │     100% │ 418.456G │ 418.456G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.1.58  │     100% │ 411.186G │ 411.186G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.2.68  │     100% │ 391.396G │ 391.396G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ╰─────────────┴──────────┴──────────┴──────────┴──────────────┴────────╯

The other two attempts, however, failed when attempting to create a directory in the /tmp of node 1:

2022-10-01 03:56:24.917: (DisruptionEvent Severity.ERROR) period_type=end event_id=dff85646-3134-44cd-8055-6afc6691d396 duration=33s: nemesis_name=MgmtBackupSpecificKeyspaces target_node=Node longevity-lwt-500G-3d-2022-2-db-node-4a98679f-4 [3.234.206.161 | 10.12.2.68] (seed: False) errors=Encountered an error on sctool command: backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 : Encountered a bad command exit code!

Command: 'sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 '

Exit code: 1

Stdout:

Stderr:

Error: create backup target: location is not accessible
 10.12.1.240: giving up after 2 attempts: agent [HTTP 500] create local tmp directory: mkdir /tmp/scylla-manager-agent-197909592: no space left on device - make sure the location is correct and credentials are set, to debug SSH to 10.12.1.240 and run "scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug"
Trace ID: uaRqNqggRx2F_d1G2FpUjw (grep in scylla-manager logs)
2022-10-01 11:23:55.564: (DisruptionEvent Severity.ERROR) period_type=end event_id=9e8fdf27-8818-473b-ab52-3f73a8b98c45 duration=34s: nemesis_name=MgmtBackupSpecificKeyspaces target_node=Node longevity-lwt-500G-3d-2022-2-db-node-4a98679f-4 [3.234.206.161 | 10.12.2.68] (seed: False) errors=Encountered an error on sctool command: backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 : Encountered a bad command exit code!

Command: 'sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 '

Exit code: 1

Stdout:

Stderr:

Error: create backup target: location is not accessible
 10.12.1.240: giving up after 2 attempts: agent [HTTP 500] create local tmp directory: mkdir /tmp/scylla-manager-agent-970583882: no space left on device - make sure the location is correct and credentials are set, to debug SSH to 10.12.1.240 and run "scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug"
Trace ID: uic6UGgITb6kpG2v6RlmNQ (grep in scylla-manager logs)

From node 1:

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered s3 provider [name=s3, region=us-east-1, chunk_size=50M, memory_pool_flush_time=5m, disable_checksum=true, no_check_bucket=true, memory_pool_use_mmap=true, provider=AWS, env_auth=true, upload_concurrency=2]"}
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered gcs provider [name=gcs, memory_pool_use_mmap=true, bucket_policy_only=true, chunk_size=50M, memory_pool_flush_time=5m, allow_create_bucket=false]"}
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered azure provider [name=azure, chunk_size=50M, memory_pool_flush_time=5m, use_msi=true, memory_pool_use_mmap=true, disable_checksum=true]"}
{"L":"DEBUG","T":"2022-10-03T05:51:02.341Z","N":"rclone","M":"Creating backend with remote \"s3:manager-backup-tests-us-east-1\""}
FAILED: create local tmp directory: mkdir /tmp/scylla-manager-agent-986322445: no space left on device

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G   29G     0 100% /

There was no other nemesis on node 1, so I can't imagine any other reason why would the directory would be occupied. What I don't understand is, why does the manager need to create a temp directory, seeing that the snapshots are saved in /var/lib/scylla/data?

Sadly the machines are long dead, and there were nearly no logs saved from the run other than our own logs. The manager server and agents logs, however, should be intertwined into the sct.log.

Logs:

https://cloudius-jenkins-test.s3.amazonaws.com/4a98679f-02ad-4c38-a717-833dd12453de/20221002_145214/sct-runner.tar.gz

Jenkins job URL

SCT issue for reference: https://github.com/scylladb/scylla-cluster-tests/issues/5344

roydahan commented 1 year ago

Why does the manager use /tmp? Do we have any space requirement for /tmp?

@karol-kokoszka can you please look into this?

karol-kokoszka commented 1 year ago

Hey @roydahan. Scylla-manager tries to create directory and an empty file in it as a part of permissions check. It was introduced to address this https://github.com/scylladb/scylla-manager-enterprise/issues/1509

Why it is /tmp/ directory ? I don't know. Most likely Rclone (lib used to transfer files between storages) needs to have an access here. If you need full detailed answer, I can dig up deeper.

This is the edge case.

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G   29G     0 100% /

^^^ we should never reach the point where disk is utilized in 100%.

roydahan commented 1 year ago

I agree it's an edge case. I thought we filled up /tmp with snapshots or manager related files.

@ShlomiBalalis do we know what filled up the filesystem? How much free space we had in the beginning of the test and at what point it was filled up? (You should have this information in the monitor, OS dashboard).

mykaul commented 1 year ago

@ShlomiBalalis ? Let's understand how / got to 100% first.

karol-kokoszka commented 1 year ago

@ShlomiBalalis Is this test available somewhere in jenkins so that it can be re-run ? The only files scylla-manager creates are the ones created when manager calls scylla to take the snapshot. The only possibility for the scylla-manager to fail here is the situation when snapshots are not cleaned, but there is a Move command to copy them to the backup bucket. Is there a chance to see if there are any leftovers after the backup ?

roydahan commented 9 months ago

We had this issue again in https://argus.scylladb.com/test/8d7705a3-b3b8-4abd-b0e2-a0e6938c1e15/runs?additionalRuns[]=601ae9bb-6284-44b4-972e-56c816587eea.

Something is filling up the root FS, hard to tell what exactly. It happened after 3 days of run and 91 nemesis...

roydahan commented 9 months ago

Restoring monitor to try and check when the root FS was filled and where. From nemesis failures, I can see it happened on 4 different nodes.

The newest one is longevity-mv-si-4d-2024-1-db-node-601ae9bb-14 which had only 2 nemesis on it: disrupt_repair_streaming_err disrupt_mgmt_restore

roydahan commented 9 months ago

Trying to correlate between the nemesis and free bytes in root FS (mountpoint="/"), focusing on node-1, I saw that the space drops (and never reclamied back) in 2 scenarios:

  1. Reboots
  2. Manager operations

Screenshot 2024-01-03 at 2 15 28

roydahan commented 9 months ago

Another example from node-3, same behavior. The major drop there is from "rebuild_streaming_error" that has a reboot in that node (node-3).

Screenshot 2024-01-03 at 2 23 05

roydahan commented 9 months ago

Another one from node-14, drops during repair_streamin_err and mgmt_repair. Screenshot 2024-01-03 at 2 26 53

roydahan commented 9 months ago

I tried correlating it with almost every metric I could find in node_exporter, swap, files, memory, xfs, disk and couldn't find anything straight forward...

One possibility it's logging, maybe one of the logs (or all) doesn't have rotation policy or maybe it's too high. Does the manager_agent log has a rotation policy?