stark-t / PAI

Pollination_Artificial_Intelligence
5 stars 1 forks source link

ScaledYOLOv4 - yolov4-p6.pt pretrained wights at 1280 x 1280 overload the GPU RAM #39

Closed valentinitnelav closed 2 years ago

valentinitnelav commented 2 years ago

Hi @stark-t ,

FYI: Managed, with the help of the cluster support team, to have the ScaledYOLOv4 running by setting the proper environment. However, I got OMM issue with both a batch size of 8 and 4. I think we will have to run these on the bigger GPU models as well.

stark-t commented 2 years ago

Do you if there is a tiny yolov4 version available?

valentinitnelav commented 2 years ago

I am not sure if there is one for the 1280 x 1280 resolution. I see this https://github.com/WongKinYiu/ScaledYOLOv4/tree/yolov4-tiny , but not sure what is the resolution.

The available weights for ScaledYOLOv4 are: "{YOLOv4-P5, YOLOv4-P6, YOLOv4-P7} use input resolution {896, 1280, 1536} for training respectively" as per https://github.com/WongKinYiu/ScaledYOLOv4#training

What about the idea of just using all models at 896 x 896 resolution? I can look at the histogram of the width and height of our images and see their distributions and averages. But can we train YOLOv5 on 896, or are the images adjusted to fit 640 or 1280?

valentinitnelav commented 2 years ago

Here are the distributions of image width and height (as red with PIL).

mean(dt$width_pil, na.rm = TRUE)
# [1] 1212.105

mean(dt$height_pil, na.rm = TRUE)
# [1] 1020.497

image

image

valentinitnelav commented 2 years ago

I just run a training job on the V100, 32 Gb RAM GPUs and still got CUDA out of memory error:

RuntimeError: CUDA out of memory. 
Tried to allocate 100.00 MiB (GPU 0; 31.75 GiB total capacity; 30.09 GiB already allocated; 41.50 MiB free; 30.54 GiB reserved in total by PyTorch)
python -m torch.distributed.launch --nproc_per_node 4 train.py \
--sync-bn \
--weights ~/PAI/detectors/ScaledYOLOv4/weights/yolov4-p6.pt \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/scripts/yolo_custom_hyp.yaml \
--epochs 300 \
--batch-size 64 \
--img-size 1280 1280 \
--nosave \
--name yolov4_scaled_p6_b8_e300_hyp_custom

I will check to see if the ScaledYOLOv4, yolov4-tiny weights work with the 1280 resolution by simply plugging them in the script above. This is my last idea to fix this.

valentinitnelav commented 2 years ago

Strange error message when trying to use the tiny weights version:

python -m torch.distributed.launch --nproc_per_node 8 train.py \
--sync-bn \
--weights ~/PAI/detectors/ScaledYOLOv4/weights/yolov4-tiny.weights \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/scripts/yolo_custom_hyp.yaml \
--epochs 300 \
--batch-size 64 \
--img-size 1280 1280 \
--name yolov4_scaled_tiny_b8_e300_hyp_custom
Traceback (most recent call last):
  File "train.py", line 416, in <module>
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/ScaledYOLOv4/utils/torch_utils.py", line 37, in select_device
    assert batch_size % ng == 0, 'batch-size %g not multiple of GPU count %g' % (batch_size, ng)
AssertionError: batch-size 64 not multiple of GPU count 7

I requested a node with 8 GPUs as usual and the total batch size is 8 GPUs * 8 batch size per GPU = 64. Not sure what I do wrong.

#!/bin/bash
#SBATCH --job-name=train_yolov4_scaled # name for the job;
#SBATCH --partition=clara-job # Request for the Clara cluster;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=32 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:8 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=5-00:00:00 # requested time in d-hh:mm:ss
#SBATCH --output=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;

EDIT:

I just noticed that there is no tiny yaml configuration file in ScaledYOLOv4/models/. There is, however, a configuration file in the branch for yolov4-tiny: yolov4-tiny.cfg, but is .cfg and not .yaml

valentinitnelav commented 2 years ago

I will proceed with setting the environment for this YOLOv4 repo: https://github.com/WongKinYiu/PyTorch_YOLOv4 and then run all the trainings with the weights for 640 pixels for yolov5 & v7 as well.

valentinitnelav commented 2 years ago

It looks like there might be an issue with the cluster. I just checked the log files and saw that no GPU device was detected (as in the past).

nvidia-smi
# Unable to determine the device handle for GPU 0000:06:00.0: Unknown Error
valentinitnelav commented 2 years ago

So, for a yolov4-tiny version of ScaledYOLOv4, there is no yaml configuration file in ScaledYOLOv4/models/. I tried to see if by any miracle it might read the yolov4-tiny.cfg, but it didn't work.

python -m torch.distributed.launch --nproc_per_node 8 train.py \
--sync-bn \
--cfg ~/PAI/detectors/ScaledYOLOv4/models/yolov4-tiny.cfg \ # it expects a yaml file which doesn't exist in the Open
[ScaledYOLOv4](https://github.com/stark-t/PAI/issues/39#) repository
--weights ~/PAI/detectors/ScaledYOLOv4/weights/yolov4-tiny.weights \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/scripts/yolo_custom_hyp.yaml \
--epochs 300 \
--batch-size 64 \
--img-size 1280 1280 \
--name yolov4_scaled_tiny_b8_e300_img1280_hyp_custom

If gave this error:

self.current_event = self.state()
  File "/software/all/PyYAML/5.1.2-GCCcore-8.3.0/lib/python3.7/site-packages/yaml/parser.py", line 174, in parse_document_start
  File "/software/all/PyYAML/5.1.2-GCCcore-8.3.0/lib/python3.7/site-packages/yaml/parser.py", line 174, in parse_document_start
  File "/software/all/PyYAML/5.1.2-GCCcore-8.3.0/lib/python3.7/site-packages/yaml/parser.py", line 174, in parse_document_start
    self.current_event = self.state()
  File "/software/all/PyYAML/5.1.2-GCCcore-8.3.0/lib/python3.7/site-packages/yaml/parser.py", line 174, in parse_document_start
    self.peek_token().start_mark)
yaml.parser    .self.peek_token().start_mark)ParserError
yaml.parser    .: self.peek_token().start_mark)ParserError    expected '<document start>', but found '<scalar>'
  in "/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/ScaledYOLOv4/models/yolov4-tiny.cfg", line 6, column 1
    self.peek_token().start_mark)
self.peek_token().start_mark)

: expected '<document start>', but found '<scalar>'
  in "/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/ScaledYOLOv4/models/yolov4-tiny.cfg", line 6, column 1
yaml.parser.ParserErroryaml.parseryaml.parser..ParserErrorParserError: expected '<document start>', but found '<scalar>'

And if I didn't mention anything in --cfg, I get this error:

 File "/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/ScaledYOLOv4/models/yolo.py", line 65, in __init__
        with open(cfg) as f:with open(cfg) as f:

FileNotFoundErrorFileNotFoundError: : [Errno 2] No such file or directory: ''[Errno 2] No such file or directory: ''
valentinitnelav commented 2 years ago

So, I guess there is no tiny version for the ScaledYOLOv4 repository. Maybe might work if we write ourselves a yaml file based on the cgf file from the yolov4-tiny brach, but that is not something I can do with the knowledge I have at the moment :D

https://github.com/WongKinYiu/ScaledYOLOv4/issues/52

valentinitnelav commented 2 years ago

Fix this by reducing the image size to 640, we drop ScaledYOLOv4 and use instead PyTorch_YOLOv4. See commit 2b9cf6aff2299b5fa177150f37104120c1f400fd