ray-project / xgboost_ray

Distributed XGBoost on Ray
Apache License 2.0
143 stars 34 forks source link

disk space usage problem #258

Open showkeyjar opened 1 year ago

showkeyjar commented 1 year ago

I found one problem:

if I use xgboost_ray to train multiple models on linux, I found the "/tmp/ray/" dir size will continued growth.

and if train data is large, the system dist space run out quickly.

I try to fix it by "rm -rf /tmp/ray/", but the train process stucked in an endless loop, and wait for ray actor forever.

I guess "import xgboost_ray" may do some init for ray,

so I add "import importlib" and try to "importlib.reload('xgboost_ray')", but it not work.

please check this issue.

rkooo567 commented 1 year ago

cc @matthewdeng what's the best way to debug object store memory usage for xgboost on ray?

@showkeyjar I think your workload has high object store usage which triggers spilling https://docs.ray.io/en/master/ray-core/objects/object-spilling.html.

When your disk usage keeps increasing, what's the output of ray memory --stats-only?

matthewdeng commented 1 year ago

@showkeyjar do you have a repro for this? How much training data are you loading and how much disk space are you seeing consumed?

Yard1 commented 1 year ago

Are you using Ray Datasets? There's an issue with xgboost-ray we are working on currently that causes the data to be loaded in a suboptimal manner, causing too much object store usage.

showkeyjar commented 1 year ago

thanks for all your advice,

@rkooo567 ray memory --stats-only cannot detect any ray instance: ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting the --address flag or RAY_ADDRESS environment variable.

@matthewdeng 1395642 train data, boost round 20, disk usage 15G

train code is here: https://github.com/showkeyjar/mymodel/blob/main/train_model_ray.py

@Yard1 no, I use pandas dataframe convert to ray dataset.

showkeyjar commented 1 year ago

I alleviated the problem using shell for loop script to call python train code, but I still don't know why python for loop cause disk increase.

and I'm sure that the disk incease happened at /tmp/ray/ dir.

Yard1 commented 1 year ago

Ray is using a mechanism called object spilling, where objects that cannot fit into the memory object store are instead put on disk. Can you run the ray memory --stats-only command in a separate terminal window while the xgboost-ray training is in progress?

Also, are you running this on a single machine, or multiple machines?

showkeyjar commented 1 year ago

@Yard1

======== Object references status: 2023-01-16 15:19:13.215008 ======== --- Aggregate object store stats across all nodes --- Plasma memory usage 67279 MiB, 40 objects, 62.69% full, 43.41% needed Objects consumed by Ray tasks: 67281 MiB.

showkeyjar commented 1 year ago

I'm so depressed this issues has not been solved yet, but I found some new infomations:

  1. ray will store its temp file in /tmp/ray/session_{datetime}_XXXX_XXXX/ dir if we could get the ray session dir, so we can remove temp file when xgb_ray train finished.
  2. ray can specific _temp_dir when init, but it still has bug, so, we can specific another temp dir when we train model if fix its bug.

hope those helps.

rkooo567 commented 1 year ago

Based on your output ^, it looks like spilling actually doesn't really happen. I guess most of disk usage is from ray logs?

rkooo567 commented 1 year ago

Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/?

showkeyjar commented 1 year ago

Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/?

yes, it create a link /tmp/ray/session_latest/ to /tmp/ray/session_{datetime}_XXXX_XXXX/