CUDA out of memory when running the train.py

SongPool commented 10 months ago

This computer has 10 GPU and still out of memory after enabling distributed training I h set args.distributed=1, and os. environ[' RANK'] = '0' os. environ[' WORLD_SIZE'] = '1'ave os. environ[' MASTER ADDR'] = 'Locathost ' os. environ[' MASTER_PORT'] = '7356'

GWCIBXZN0Q`CLLNMS2IP%L8

TEHT29Q3Z6A~`3WNY_UG`B6

Ryandonofrio3 commented 10 months ago

Yes having same issue. Not sure much can be done. Getting access to local compute cluster today to try and rectify

qianqianwang68 commented 10 months ago

Hi, what command did you use to run the distributed training? It should be python -m torch.distributed.launch --nproc_per_node={NUM_GPUs} train.py .... At the same time, you should try to reduce the number of pts and chunk size. The total number of points trained equals NUM_GPUs * args.num_pts, which means that num_pts can be reduced to 1/NUM_GPUs of the total number of points desired.

psandovalsegura commented 8 months ago

I was having the same issue, and noticed only one GPU was getting used before the OOM. What fixed it for me was adding this line in train.py:

if args.distributed:
        args.local_rank = int(os.environ['LOCAL_RANK']) # Add this line!
        torch.cuda.set_device(args.local_rank)
        torch.distributed.init_process_group(backend="nccl", init_method="env://")
        synchronize()
        ...

Because in config.py, args.local_rank is set to 0 and this rank is used in trainer.py. This is also explained in the torch.distributed.launch logs:

If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

Overall, I run the following to start training on multiple GPUs:

python -m torch.distributed.launch --nproc_per_node=${WORLD_SIZE} train.py \
       --config configs/default.txt --data_dir $SEQUENCE_DIR --save_dir $OUTPUT_DIR \
       --distributed 1 --num_pts 64

I also:

Set --num_pts 64 since I am training on 4 GPUs
Increased CPU memory (I was getting error: Detected 5 oom_kill events in StepId=1644314.batch. Some of the step tasks have been OOM Killed. errors otherwise)

qianqianwang68 / omnimotion

CUDA out of memory when running the train.py #22