Closed Git-oNmE closed 2 years ago
We do not support multi GPU training in the current version. A single NVIDIA 3090 is enough for the training. Besides, if you use multi GPU, the batch size should be “integer multiple” of the number of gpus.
We do not support multi GPU training in the current version. A single NVIDIA 3090 is enough for the training. Besides, if you use multi GPU, the batch size should be “integer multiple” of the number of gpus. I changed my
train_kitti_det.sh
and set it like this:python train_feats.py --batch_size 1 --epochs 100 --lr 0.001 --seed 1 --gpu 2 \ --npoints 16384 --dataset kitti --voxel_size 0.3 --ckpt_dir /media/data3/hlf_data/HRegNet0/HRegNet/ckpt \ --use_fps --use_weights --data_list ./data/kitti_list --runname "train_kitti_det0" --augment 0.5 \ --root /media/data3/hlf_data/HRegNet0/HRegNet/data/kitti_list --wandb_dir /media/data3/hlf_data/HRegNet0/HRegNet/wandb_env --use_wandb
GPU 2 is not occupied by the way. And I still got the same error message.
However, I printed a lot of run messages, found that the program stopped when came into this code: line 26.
But I copied the same code and ran it on my computer (only GTX 960M), and I ran the code successfully in my computer.
That's all I can do until now, how can I deal with this problem?
I'm sorry I have no idea about this problem and can not provide any help......
I have solved this problem. Details in my blog: https://blog.csdn.net/weixin_40286308/article/details/124870766
I ran the code
sh scripts/train_kitti_det.sh
, and I got error message like this:I looked into it on the google, they said it was because batch_size is too large or num_worker is too large.
I set parameter "batch_size" as 1, and set parameter "gpu" as "0,1,2"(everyone is a 3090 gpu), and set train_feats.py/train_loader/num_workers as 1,but I still got the same message.
Here is my train_kitti_det.sh by the way.
Is there any way to solve this? :)