Open showkeyjar opened 1 year ago
cc @matthewdeng what's the best way to debug object store memory usage for xgboost on ray?
@showkeyjar I think your workload has high object store usage which triggers spilling https://docs.ray.io/en/master/ray-core/objects/object-spilling.html.
When your disk usage keeps increasing, what's the output of ray memory --stats-only
?
@showkeyjar do you have a repro for this? How much training data are you loading and how much disk space are you seeing consumed?
Are you using Ray Datasets? There's an issue with xgboost-ray we are working on currently that causes the data to be loaded in a suboptimal manner, causing too much object store usage.
thanks for all your advice,
@rkooo567 ray memory --stats-only
cannot detect any ray instance:
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting the --address
flag or RAY_ADDRESS
environment variable.
@matthewdeng 1395642 train data, boost round 20, disk usage 15G
train code is here: https://github.com/showkeyjar/mymodel/blob/main/train_model_ray.py
@Yard1 no, I use pandas dataframe convert to ray dataset.
I alleviated the problem using shell for loop script to call python train code, but I still don't know why python for loop cause disk increase.
and I'm sure that the disk incease happened at /tmp/ray/
dir.
Ray is using a mechanism called object spilling, where objects that cannot fit into the memory object store are instead put on disk. Can you run the ray memory --stats-only
command in a separate terminal window while the xgboost-ray training is in progress?
Also, are you running this on a single machine, or multiple machines?
@Yard1
======== Object references status: 2023-01-16 15:19:13.215008 ======== --- Aggregate object store stats across all nodes --- Plasma memory usage 67279 MiB, 40 objects, 62.69% full, 43.41% needed Objects consumed by Ray tasks: 67281 MiB.
I'm so depressed this issues has not been solved yet, but I found some new infomations:
hope those helps.
Based on your output ^, it looks like spilling actually doesn't really happen. I guess most of disk usage is from ray logs?
Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/
?
Is it correct the disk usage is mostly from
/tmp/ray/session_latest/logs/
?
yes, it create a link /tmp/ray/session_latest/
to /tmp/ray/session_{datetime}_XXXX_XXXX/
I found one problem:
if I use xgboost_ray to train multiple models on linux, I found the "/tmp/ray/" dir size will continued growth.
and if train data is large, the system dist space run out quickly.
I try to fix it by "rm -rf /tmp/ray/", but the train process stucked in an endless loop, and wait for ray actor forever.
I guess "import xgboost_ray" may do some init for ray,
so I add "import importlib" and try to "importlib.reload('xgboost_ray')", but it not work.
please check this issue.