Let me mention that training on dataset with 6K inputs and 1 class, works great. However the training with 300 classes and 6000 images large dateset causes the following error:
------------CPU Mode for This Batch-------------
2022-05-25 13:52:53 | INFO | yolox.models.yolo_head:335 - OOM RuntimeError is raised due to the huge memory cost during label assignment.
CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size.
OOM RuntimeError is raised due to the huge memory cost during label assignment.
CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size.
Training continues for some time, then quits completely with "CUDA out of memory":
RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 7.79 GiB total capacity; 6.20 GiB already allocated; 21.44 MiB free; 6.26 GiB reserved in total by PyTorch)
Trying to fix it. Help from everyone else is welcome. Please drop your solutions/thoughts here. As soon as I find a solution, I will share it here too. Thank you.
I'm running on 2 GPUs. GTX 1070 and RTX 3070. Should be plenty. Platform needs more optimization.
P.S. Lowering batch doesn't help. I set batch to "-b 2" and devices also to "-d 2". 1 batch per GPU. Can't get lower than that.
Image size is set to:
self.input_size = (256, 512)
self.test_size = (256, 512)
Also very low res.
Just tried to cut inputs down to 10 images, but left 300 classes. Still issue reproduces. Same low res, same 1 batch per GPU. Issue is definitely caused by the high number of classes.
Let me mention that training on dataset with 6K inputs and 1 class, works great. However the training with 300 classes and 6000 images large dateset causes the following error:
Training continues for some time, then quits completely with "CUDA out of memory":
RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 7.79 GiB total capacity; 6.20 GiB already allocated; 21.44 MiB free; 6.26 GiB reserved in total by PyTorch)
Trying to fix it. Help from everyone else is welcome. Please drop your solutions/thoughts here. As soon as I find a solution, I will share it here too. Thank you.
I'm running on 2 GPUs. GTX 1070 and RTX 3070. Should be plenty. Platform needs more optimization.
P.S. Lowering batch doesn't help. I set batch to "-b 2" and devices also to "-d 2". 1 batch per GPU. Can't get lower than that.
Image size is set to: self.input_size = (256, 512) self.test_size = (256, 512) Also very low res.