Confused: set the batch_size=1, still out of memory

open-mmlab / OpenPCDet

OpenPCDet Toolbox for LiDAR-based 3D Object Detection.

Apache License 2.0

4.57k stars 1.28k forks source link

Confused: set the batch_size=1, still out of memory #140

Closed liyang0522 closed 4 years ago

liyang0522 commented 4 years ago

now i got a 2080TI A with 11G memory, when i edit the code of pvrcnn and use one gpu to train, i set the batch_size to 1 ,but i got the out of memory error, so i add another 2080TI B，i guess i got more memory , i run 'python train.py --batch_size 1' the out of memory still occurs , i check the utilization of the new 2080TI B is 0%, dose it mean that every graphics card handles different batch, not the same batch? if one graphic card can't handle the model process with batch_size=1, adding more graphics card can't solve this problem? so i was wondering how i solve the problem,thank you！

Gus-Guo commented 4 years ago

Hi, could you take a screenshot of the errors and post it here?

liyang0522 commented 4 years ago

firstly, i have 2 cards: card 0 and card 1 ( the coding running in the card 1 is others' code),card 0 is available now: ()

when i run 'python train.py --batch_size 1' on card 0, i got the error:

the card 1 dosen't allocate any memory for my code

i set the grid_size from 6 to 5, i run 'python train.py --batch_size' again, the code can run successfully, the usage of card 0 is about 10G, but i got worse result. i want the same parameter as the user's, but once i add some modification in the original code, i got the out of memory error, the card 1 was put here today 'stolen' from another machine ,but 2 cards seem can't solve my problem, so what should i do?

sshaoshuai commented 4 years ago

Have you modified the codes? Please reset the codes to original codes and try to run with the following command (make sure you use the specific spconv 1.0):

CUDA_VISIBLE_DEVICES=0 python train.py --cfg_file cfgs/kitti_models/pv_rcnn.yaml --batch_size 2 --epochs 50

If it still occurs 'CUDA out of memory', maybe you could try to re-build your environment...

liyang0522 commented 4 years ago

i just creater another new project, there is no modification in the original codes. spconv version is 1.0, i run CUDA_VISIBLE_DEVICES=0 python train.py --cfg_file cfgs/kitti_models/pv_rcnn.yaml --batch_size 2 --epochs 50 , the cuda out of memory still occurs : 批注 2020-07-10 095346

so i set the batch_size from 2 to 1, it can successfully run ,it consumes about 10G memory

批注 2020-07-10 095442

so the original code at least needs 10G, in my 2080TI with 11G memory ,the batch_size can only be set to 1.... but i want add some modification in the original code ,once i add some modification in original code, 'cude oom' occurs,...

sshaoshuai commented 4 years ago

I have tested with multiple machines for PV-RCNN, generally it costs about 4.5G GPU memory for batch_size=1, and 2080TI is enough to train with batch_size=2 for pv_rcnn. Have you tried to create a new environment?

liyang0522 commented 4 years ago

finally， it works！thank you for helping me!
I just create a new environment :new conda environment , new spconv 1.0, when i set batch_size=1, it costs about 5G, although it can run with batch_size=2, but i still dont how the reason why a new environment can solve this issue.

sshaoshuai commented 4 years ago

Congratulations!

Hub-Tian commented 4 years ago

I have similar problem that using a batch size of 1 consumes about 8G memory. I rebuild the environment but it doesn't work. I tried on different devices (titan x and 2080ti), both of them take about 8G memory. I use the original code of PVRCNN and spconv is under v1.0. Any idea to solve this problem ?

liyang0522 commented 4 years ago

what i did is i delete spconv and old pvrcnn environment, and i create another new conda environment instead of rebuilding it. you may have a try

Hub-Tian commented 4 years ago

@liyang0522 Thanks very much！It worked! I recreated a conda environment and rebuild openpcdet. Now, it takes about 5G memory for titan x gpu.

OrangeSodahub commented 2 years ago

Hello, I occured this problem now. I want to ask How can I know what version of spconv I 'm using? And my card has only 3.8G memory, can pv-rcnn work? If not, which one is suitable? Is PointRCNN OK?