Backup command failed: `create local tmp directory: mkdir /tmp/scylla-manager-agent-197909592: no space left on device`

ShlomiBalalis commented 2 years ago

Installation details

Manager: Client version: 3.0.0-0.20220523.5501e5d7f53 Server version: 3.0.0-0.20220523.5501e5d7f53

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e

Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-lwt-500G-3d-2022-2-db-node-4a98679f-7 (100.27.38.242 | 10.12.2.247) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-6 (3.218.248.198 | 10.12.3.36) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-5 (18.213.247.208 | 10.12.2.105) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-4 (3.234.206.161 | 10.12.2.68) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-3 (3.236.36.209 | 10.12.0.98) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-2 (44.195.32.109 | 10.12.1.58) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1 (3.237.96.30 | 10.12.1.240) (shards: 14)

OS / Image: ami-0b6ff8cdcbe0cb88a (aws: us-east-1)

Test: longevity-lwt-500G-3d-test Test id: 4a98679f-02ad-4c38-a717-833dd12453de Test name: enterprise-2022.2/longevity/longevity-lwt-500G-3d-test Test config file(s):

longevity-lwt-500G-3d.yaml

Issue description

In this run, there were three backup tasks, with the first one being successful and the other two failing instantly upon creation. The first, successful one:

< t:2022-09-30 02:30:34,240 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --location s3:manager-backup-tests-us-east-1 " finished with status 0
< t:2022-09-30 02:30:34,240 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > sctool output: backup/a90c90b5-0a20-46c1-ad98-d159f8967bb6
...
< t:2022-09-30 04:48:41,676 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c e813a803-c47d-4640-8dd1-995e2d6dea4d progress backup/a90c90b5-0a20-46c1-ad98-d159f8967bb6" finished with status 0
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               dc2246d4-4067-11ed-902a-02610c34aa9d
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Status:           DONE
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Start time:       30 Sep 22 02:30:33 UTC
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > End time: 30 Sep 22 04:48:37 UTC
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2h18m3s
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Progress: 100%
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20220930023035UTC
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:      
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG >   - us-east
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ╭─────────────┬──────────┬──────────┬──────────┬──────────────┬────────╮
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ Host        │ Progress │     Size │  Success │ Deduplicated │ Failed │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ├─────────────┼──────────┼──────────┼──────────┼──────────────┼────────┤
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.0.98  │     100% │ 399.674G │ 399.674G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.1.240 │     100% │ 418.456G │ 418.456G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.1.58  │     100% │ 411.186G │ 411.186G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.12.2.68  │     100% │ 391.396G │ 391.396G │            0 │      0 │
< t:2022-09-30 04:48:41,677 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ╰─────────────┴──────────┴──────────┴──────────┴──────────────┴────────╯

The other two attempts, however, failed when attempting to create a directory in the /tmp of node 1:

2022-10-01 03:56:24.917: (DisruptionEvent Severity.ERROR) period_type=end event_id=dff85646-3134-44cd-8055-6afc6691d396 duration=33s: nemesis_name=MgmtBackupSpecificKeyspaces target_node=Node longevity-lwt-500G-3d-2022-2-db-node-4a98679f-4 [3.234.206.161 | 10.12.2.68] (seed: False) errors=Encountered an error on sctool command: backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 : Encountered a bad command exit code!

Command: 'sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 '

Exit code: 1

Stdout:

Stderr:

Error: create backup target: location is not accessible
 10.12.1.240: giving up after 2 attempts: agent [HTTP 500] create local tmp directory: mkdir /tmp/scylla-manager-agent-197909592: no space left on device - make sure the location is correct and credentials are set, to debug SSH to 10.12.1.240 and run "scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug"
Trace ID: uaRqNqggRx2F_d1G2FpUjw (grep in scylla-manager logs)

2022-10-01 11:23:55.564: (DisruptionEvent Severity.ERROR) period_type=end event_id=9e8fdf27-8818-473b-ab52-3f73a8b98c45 duration=34s: nemesis_name=MgmtBackupSpecificKeyspaces target_node=Node longevity-lwt-500G-3d-2022-2-db-node-4a98679f-4 [3.234.206.161 | 10.12.2.68] (seed: False) errors=Encountered an error on sctool command: backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 : Encountered a bad command exit code!

Command: 'sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repair_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_during_repair_ks_8  --location s3:manager-backup-tests-us-east-1 '

Exit code: 1

Stdout:

Stderr:

Error: create backup target: location is not accessible
 10.12.1.240: giving up after 2 attempts: agent [HTTP 500] create local tmp directory: mkdir /tmp/scylla-manager-agent-970583882: no space left on device - make sure the location is correct and credentials are set, to debug SSH to 10.12.1.240 and run "scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug"
Trace ID: uic6UGgITb6kpG2v6RlmNQ (grep in scylla-manager logs)

From node 1:

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered s3 provider [name=s3, region=us-east-1, chunk_size=50M, memory_pool_flush_time=5m, disable_checksum=true, no_check_bucket=true, memory_pool_use_mmap=true, provider=AWS, env_auth=true, upload_concurrency=2]"}
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered gcs provider [name=gcs, memory_pool_use_mmap=true, bucket_policy_only=true, chunk_size=50M, memory_pool_flush_time=5m, allow_create_bucket=false]"}
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered azure provider [name=azure, chunk_size=50M, memory_pool_flush_time=5m, use_msi=true, memory_pool_use_mmap=true, disable_checksum=true]"}
{"L":"DEBUG","T":"2022-10-03T05:51:02.341Z","N":"rclone","M":"Creating backend with remote \"s3:manager-backup-tests-us-east-1\""}
FAILED: create local tmp directory: mkdir /tmp/scylla-manager-agent-986322445: no space left on device

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G   29G     0 100% /

There was no other nemesis on node 1, so I can't imagine any other reason why would the directory would be occupied. What I don't understand is, why does the manager need to create a temp directory, seeing that the snapshots are saved in /var/lib/scylla/data?

Sadly the machines are long dead, and there were nearly no logs saved from the run other than our own logs. The manager server and agents logs, however, should be intertwined into the sct.log.

Restore Monitor Stack command: $ hydra investigate show-monitor 4a98679f-02ad-4c38-a717-833dd12453de
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 4a98679f-02ad-4c38-a717-833dd12453de

Logs:

https://cloudius-jenkins-test.s3.amazonaws.com/4a98679f-02ad-4c38-a717-833dd12453de/20221002_145214/sct-runner.tar.gz

Jenkins job URL

SCT issue for reference: https://github.com/scylladb/scylla-cluster-tests/issues/5344

roydahan commented 1 year ago

Why does the manager use /tmp? Do we have any space requirement for /tmp?

@karol-kokoszka can you please look into this?

karol-kokoszka commented 1 year ago

Hey @roydahan. Scylla-manager tries to create directory and an empty file in it as a part of permissions check. It was introduced to address this https://github.com/scylladb/scylla-manager-enterprise/issues/1509

Why it is /tmp/ directory ? I don't know. Most likely Rclone (lib used to transfer files between storages) needs to have an access here. If you need full detailed answer, I can dig up deeper.

This is the edge case.

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G   29G     0 100% /

^^^ we should never reach the point where disk is utilized in 100%.

roydahan commented 1 year ago

I agree it's an edge case. I thought we filled up /tmp with snapshots or manager related files.

@ShlomiBalalis do we know what filled up the filesystem? How much free space we had in the beginning of the test and at what point it was filled up? (You should have this information in the monitor, OS dashboard).

mykaul commented 1 year ago

@ShlomiBalalis ? Let's understand how / got to 100% first.

karol-kokoszka commented 1 year ago

@ShlomiBalalis Is this test available somewhere in jenkins so that it can be re-run ? The only files scylla-manager creates are the ones created when manager calls scylla to take the snapshot. The only possibility for the scylla-manager to fail here is the situation when snapshots are not cleaned, but there is a Move command to copy them to the backup bucket. Is there a chance to see if there are any leftovers after the backup ?