stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

YOLOv7, CUDA out of memory when training #34

Closed valentinitnelav closed 2 years ago

valentinitnelav commented 2 years ago

Hi @stark-t , Not sure how to solve this issue without reducing further the batch size, which is not ideal as it is already 8.

I get this error:

RuntimeError: CUDA out of memory. 
Tried to allocate 50.00 MiB (GPU 4; 10.76 GiB total capacity; 9.45 GiB already allocated; 30.56 MiB free; 9.56 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I do not have --cache-images option in the job script. FYI: This doesn't happen with the nano or small weights for YOLOv5, where also batch size 8 was used. The YOLOv7 yolov7-w6.pt weights seem to be the smallest weights we could use for a image size of 1280 x 1280, so that the YOLOv7 setup is comparable with the YOLOv5 setup.

This is part of the train job:

python -m torch.distributed.launch --nproc_per_node 8 train.py \
--weights ~/PAI/detectors/yolov7/weights_v0_1/yolov7-w6.pt \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/detectors/yolov7/data/hyp.scratch.p5.yaml \
--epochs 300 \
--batch-size 64 \
--img-size 1280 1280 \
--workers 6 \
--name yolov7_n6_b8_e300_hyp_p5

--batch-size 64 is the total batch size and it gets distributed to the 8 GPUs, resulting in batch size = 8 (I double-checked this in the .err file). The job only started error free if I set --batch-size 32, that is, a batch size of only 4 per GPU.

Initially, I thought it might be because I had also --sync-bn in there, since is recommended here when the batch-size on each GPU is small (<= 8):

"SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training. It is best used when the batch-size on each GPU is small (<= 8)"

But with, or without --sync-bn, I get the same error.

Any ideas on how to solve this and keep a batch size of 8 per GPU for YOLOv7 setup?

stark-t commented 2 years ago

@valentinitnelav can you check if the error exists when only training with one gpu?

valentinitnelav commented 2 years ago

I tried also using a single GPU, but I hit the same RAM issue. I also tried the suggestion regarding using torch.cuda.empty_cache() placed in the train.py script. I placed this in different locations (just trying):

At the beginning of the train.py script, immediately after imports:

torch.cuda.empty_cache()

logger = logging.getLogger(__name__)

Just before the training:

# Start training
torch.cuda.empty_cache()

t0 = time.time()

Just before the loop of each epoch:

for epoch in range(start_epoch, epochs):  # epoch ------------------------------------------------------------------
    torch.cuda.empty_cache() # here
    model.train()

I think in order not to reduce the batch size from 8 to 4, then we will use the GPUs with 32 Gb and deal with the longer waiting time :) I will close this issue here. From the error message, it is clear that all the RAM of a GPU is used and nothing remains free.

valentinitnelav commented 2 years ago

Note for archive:

We also tried these hyp settings in the scripts/yolo_custom_hyp.yaml file (reduce data augmentation):

fliplr: 0.0  # image flip left-right (probability)
mosaic: 0.0  # image mosaic (probability)
mixup: 0.0  # image mixup (probability)
copy_paste: 0.0  # image copy paste (probability)
paste_in: 0.0  # image copy paste (probability)

We still got back:

RuntimeError: CUDA out of memory. 
Tried to allocate 50.00 MiB (GPU 0; 10.76 GiB total capacity; 9.50 GiB already allocated; 4.56 MiB free; 9.60 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF