Clean snapshots of passed backup nemeses, to avoid non-intentional ENOSPC

juliayakovlev commented 2 years ago

https://argus.scylladb.com/test/8831dfed-1945-4e7e-a0c6-6d3f848868b4/runs?additionalRuns%5B%5D=4a98679f-02ad-4c38-a717-833dd12453de

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 2022.2.0~rc2-20220919.75d087a2b75a with build-id 463f1a57b82041a6c6b6441f0cbc26c8ad93091e

Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-lwt-500G-3d-2022-2-db-node-4a98679f-7 (100.27.38.242 | 10.12.2.247) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-6 (3.218.248.198 | 10.12.3.36) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-5 (18.213.247.208 | 10.12.2.105) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-4 (3.234.206.161 | 10.12.2.68) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-3 (3.236.36.209 | 10.12.0.98) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-2 (44.195.32.109 | 10.12.1.58) (shards: 14)
longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1 (3.237.96.30 | 10.12.1.240) (shards: 14)

OS / Image: ami-0b6ff8cdcbe0cb88a (aws: us-east-1)

Test: longevity-lwt-500G-3d-test Test id: 4a98679f-02ad-4c38-a717-833dd12453de Test name: enterprise-2022.2/longevity/longevity-lwt-500G-3d-test Test config file(s):

longevity-lwt-500G-3d.yaml

Issue description

Managment backup nemeses failed due to "no space left on device" on node1 (10.12.1.240 )

Command: 'sudo sctool backup -c e813a803-c47d-4640-8dd1-995e2d6dea4d --keyspace drop_table_during_repair_ks_0,lwt_builtin_function_test,drop_table_during_repair_ks_3,cqlstress_lwt_example,drop_table_during_repai
r_ks_1,drop_table_during_repair_ks_2,drop_table_during_repair_ks_7,drop_table_during_repair_ks_6,keyspace1,drop_table_during_repair_ks_5,drop_table_during_repair_ks_9,drop_table_during_repair_ks_4,drop_table_dur
ing_repair_ks_8  --location s3:manager-backup-tests-us-east-1 '

Exit code: 1

Error: create backup target: location is not accessible
 10.12.1.240: giving up after 2 attempts: agent [HTTP 500] create local tmp directory: mkdir /tmp/scylla-manager-agent-197909592: no space left on device - make sure the location is correct and credentials are set, to debug SSH to 10.12.1.240 and run "scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug"
Trace ID: uaRqNqggRx2F_d1G2FpUjw (grep in scylla-manager logs

tmp fs is used for 100%

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ scylla-manager-agent check-location -L s3:manager-backup-tests-us-east-1 --debug
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered s3 provider [name=s3, region=us-east-1, chunk_size=50M, memory_pool_flush_time=5m, disable_checksum=true, no_check_bucket=true, memory_pool_use_mmap=true, provider=AWS, env_auth=true, upload_concurrency=2]"}
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered gcs provider [name=gcs, memory_pool_use_mmap=true, bucket_policy_only=true, chunk_size=50M, memory_pool_flush_time=5m, allow_create_bucket=false]"}
{"L":"INFO","T":"2022-10-03T05:51:02.340Z","N":"rclone","M":"registered azure provider [name=azure, chunk_size=50M, memory_pool_flush_time=5m, use_msi=true, memory_pool_use_mmap=true, disable_checksum=true]"}
{"L":"DEBUG","T":"2022-10-03T05:51:02.341Z","N":"rclone","M":"Creating backend with remote \"s3:manager-backup-tests-us-east-1\""}
FAILED: create local tmp directory: mkdir /tmp/scylla-manager-agent-986322445: no space left on device

scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G   29G     0 100% /

Restore Monitor Stack command: $ hydra investigate show-monitor 4a98679f-02ad-4c38-a717-833dd12453de
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 4a98679f-02ad-4c38-a717-833dd12453de

Logs:

No logs captured during this run.

Jenkins job URL

fruch commented 2 years ago

@juliayakovlev sounds like a bug for manager, why would it use tmpfs ? i.e memory and not disk ? (I.e. Scylla snapshot are on disk not tmp)

also any leftovers should be handled by manager itself.

from SCT POV, seems we have keyspaces that should have been cleared (all the during repair keyspaces)

fgelcer commented 2 years ago

it was in the context of https://argus.scylladb.com/test/8831dfed-1945-4e7e-a0c6-6d3f848868b4/runs?additionalRuns%5B%5D=4a98679f-02ad-4c38-a717-833dd12453de

and IIUC, 2 backup nemeses failed, and in the end, we ran into ENOSPC... my suggestion was, in these cases, to remove the snapshots from the system to avoid filling the disk up...

probably the successful backups have the snapshots deleted by default, but the failed ones, don't.. @ShlomiBalalis , can you please describe here the behavior (for the snapshots) in both cases of success and failure?

fruch commented 2 years ago

it was in the context of https://argus.scylladb.com/test/8831dfed-1945-4e7e-a0c6-6d3f848868b4/runs?additionalRuns%5B%5D=4a98679f-02ad-4c38-a717-833dd12453de

and IIUC, 2 backup nemeses failed, and in the end, we ran into ENOSPC... my suggestion was, in these cases, to remove the snapshots from the system to avoid filling the disk up...

probably the successful backups have the snapshots deleted by default, but the failed ones, don't.. @ShlomiBalalis , can you please describe here the behavior (for the snapshots) in both cases of success and failure?

If we are doing snapshots into /tmp, it will always be a problem, since we have less memory then diskspace

scylladb / scylla-cluster-tests