stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

Caching data to ram, None or disk - OOM issues #28

Closed valentinitnelav closed 1 year ago

valentinitnelav commented 2 years ago

This is to document for us what happens on the cluster's side when training a model regarding the --cache option.

In train.py there is the --cache option which can take ram, None, or disk. The help for this argument says --cache images in "ram" (default) or "disk", but it seems that there are 3 options as documented here:

--cache ram

When I tried to use ram with a nano model, it run out of memory (unclear if is because of GPU RAM, 11Gb per GPU, or total RAM per requested node which is 516Gb, which should be plenty). The final error message from the *.err file is:

# slurmstepd: error: Detected 6 oom-kill event(s) in StepId=3159683.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

For a full example with session and error messages, see 3159683.err file (in /home/vs66tavy/Nextcloud/yolo-runs-clara/scripts_backups/PAI/cache_options_err_files).

The job script had this call to train.py:

python -m torch.distributed.launch --nproc_per_node 8 train.py \
--weights ~/PAI/detectors/yolov5/weights_v6_1/yolov5s6.pt \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp hyp.scratch-med.yaml \
--epochs 300 \
--batch-size 64 \
--imgsz 1280 \
--cache ram \
--workers 6 \
--name p1_w-s6_hyp-med_8b_300e

Do not specify --cache

If I do not specify the option --cache in train.py, then I do not get an error message regarding OOM issues. For example, see 3144159.err file (in /home/vs66tavy/Nextcloud/yolo-runs-clara/scripts_backups/PAI/cache_options_err_files).

However, two labels.cache files are still created: one in the train folder and the other in the val folder. This can interfere with those from YOLOv7 since it creates also labels.cache files in the same places. Their readme files recommend to delete any existing such cache files from previous runs of a different YOLO version, see https://github.com/WongKinYiu/yolov7#training

--cache disk

In this case, YOLOv5 creates for each image in train & val folders a new *.npy file. There will be an explosion in terms of data storage but there is enough space on the cluster for now. For example, see 3109362.err file (in /home/vs66tavy/Nextcloud/yolo-runs-clara/scripts_backups/PAI/cache_options_err_files).

It is unclear if this truly speeds up the training process.