Open ShlomiBalalis opened 2 years ago
Why does the manager use /tmp? Do we have any space requirement for /tmp?
@karol-kokoszka can you please look into this?
Hey @roydahan. Scylla-manager tries to create directory and an empty file in it as a part of permissions check. It was introduced to address this https://github.com/scylladb/scylla-manager-enterprise/issues/1509
Why it is /tmp/
directory ? I don't know. Most likely Rclone (lib used to transfer files between storages) needs to have an access here. If you need full detailed answer, I can dig up deeper.
This is the edge case.
scyllaadm@longevity-lwt-500G-3d-2022-2-db-node-4a98679f-1:~$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 29G 0 100% /
^^^ we should never reach the point where disk is utilized in 100%.
I agree it's an edge case. I thought we filled up /tmp with snapshots or manager related files.
@ShlomiBalalis do we know what filled up the filesystem? How much free space we had in the beginning of the test and at what point it was filled up? (You should have this information in the monitor, OS dashboard).
@ShlomiBalalis ? Let's understand how / got to 100% first.
@ShlomiBalalis Is this test available somewhere in jenkins so that it can be re-run ? The only files scylla-manager creates are the ones created when manager calls scylla to take the snapshot. The only possibility for the scylla-manager to fail here is the situation when snapshots are not cleaned, but there is a Move command to copy them to the backup bucket. Is there a chance to see if there are any leftovers after the backup ?
We had this issue again in https://argus.scylladb.com/test/8d7705a3-b3b8-4abd-b0e2-a0e6938c1e15/runs?additionalRuns[]=601ae9bb-6284-44b4-972e-56c816587eea.
Something is filling up the root FS, hard to tell what exactly. It happened after 3 days of run and 91 nemesis...
Restoring monitor to try and check when the root FS was filled and where. From nemesis failures, I can see it happened on 4 different nodes.
The newest one is longevity-mv-si-4d-2024-1-db-node-601ae9bb-14 which had only 2 nemesis on it: disrupt_repair_streaming_err disrupt_mgmt_restore
Trying to correlate between the nemesis and free bytes in root FS (mountpoint="/"), focusing on node-1, I saw that the space drops (and never reclamied back) in 2 scenarios:
Another example from node-3, same behavior. The major drop there is from "rebuild_streaming_error" that has a reboot in that node (node-3).
Another one from node-14, drops during repair_streamin_err and mgmt_repair.
I tried correlating it with almost every metric I could find in node_exporter, swap, files, memory, xfs, disk and couldn't find anything straight forward...
One possibility it's logging, maybe one of the logs (or all) doesn't have rotation policy or maybe it's too high. Does the manager_agent log has a rotation policy?
Installation details
Manager: Client version:
3.0.0-0.20220523.5501e5d7f53
Server version:3.0.0-0.20220523.5501e5d7f53
Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash):
2022.2.0~rc2-20220919.75d087a2b75a
with build-id463f1a57b82041a6c6b6441f0cbc26c8ad93091e
Cluster size: 4 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0b6ff8cdcbe0cb88a
(aws: us-east-1)Test:
longevity-lwt-500G-3d-test
Test id:4a98679f-02ad-4c38-a717-833dd12453de
Test name:enterprise-2022.2/longevity/longevity-lwt-500G-3d-test
Test config file(s):Issue description
In this run, there were three backup tasks, with the first one being successful and the other two failing instantly upon creation. The first, successful one:
The other two attempts, however, failed when attempting to create a directory in the
/tmp
of node 1:From node 1:
There was no other nemesis on node 1, so I can't imagine any other reason why would the directory would be occupied. What I don't understand is, why does the manager need to create a temp directory, seeing that the snapshots are saved in
/var/lib/scylla/data
?Sadly the machines are long dead, and there were nearly no logs saved from the run other than our own logs. The manager server and agents logs, however, should be intertwined into the sct.log.
$ hydra investigate show-monitor 4a98679f-02ad-4c38-a717-833dd12453de
$ hydra investigate show-logs 4a98679f-02ad-4c38-a717-833dd12453de
Logs:
https://cloudius-jenkins-test.s3.amazonaws.com/4a98679f-02ad-4c38-a717-833dd12453de/20221002_145214/sct-runner.tar.gz
Jenkins job URL
SCT issue for reference: https://github.com/scylladb/scylla-cluster-tests/issues/5344