hailanyi / VirConv

Virtual Sparse Convolution for Multimodal 3D Object Detection
https://arxiv.org/abs/2303.02314
Apache License 2.0
276 stars 39 forks source link

Failed in multiple GPU training #56

Open EvW1998 opened 10 months ago

EvW1998 commented 10 months ago

I could train with a single GPU, but when I try to run with multiple GPU by running dist_train.sh, the program stopped without reporting anything.

My dist_train.sh is like this:

CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 --master_port 29501 train.py --launcher pytorch > log.txt&

The log.txt shows like this:

/usr/local/miniconda3/envs/pcdt/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:***** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Feels like something wrong with distributed, any ideas? Thanks

vehxianfish commented 7 months ago

Hi, @EvW1998. I also meet this question. Do you solve it?