Open SongPool opened 10 months ago
Yes having same issue. Not sure much can be done. Getting access to local compute cluster today to try and rectify
Hi, what command did you use to run the distributed training? It should be python -m torch.distributed.launch --nproc_per_node={NUM_GPUs} train.py ...
. At the same time, you should try to reduce the number of pts and chunk size. The total number of points trained equals NUM_GPUs * args.num_pts, which means that num_pts can be reduced to 1/NUM_GPUs of the total number of points desired.
I was having the same issue, and noticed only one GPU was getting used before the OOM. What fixed it for me was adding this line in train.py:
if args.distributed:
args.local_rank = int(os.environ['LOCAL_RANK']) # Add this line!
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend="nccl", init_method="env://")
synchronize()
...
Because in config.py, args.local_rank
is set to 0 and this rank is used in trainer.py. This is also explained in the torch.distributed.launch logs:
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
Overall, I run the following to start training on multiple GPUs:
python -m torch.distributed.launch --nproc_per_node=${WORLD_SIZE} train.py \
--config configs/default.txt --data_dir $SEQUENCE_DIR --save_dir $OUTPUT_DIR \
--distributed 1 --num_pts 64
I also:
--num_pts 64
since I am training on 4 GPUserror: Detected 5 oom_kill events in StepId=1644314.batch. Some of the step tasks have been OOM Killed.
errors otherwise)
This computer has 10 GPU and still out of memory after enabling distributed training I h set args.distributed=1, and os. environ[' RANK'] = '0' os. environ[' WORLD_SIZE'] = '1'ave os. environ[' MASTER ADDR'] = 'Locathost ' os. environ[' MASTER_PORT'] = '7356'