Closed alexk-ede closed 2 years ago
👋 Hello @alexk-ede, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.
Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.
Hi and happy monday to you.
I use autobatch quite frequently and also just updated my local yolov5 today so I took a look.
GTX 1080 (8GB), PyTorch 1.12, Nvidia driver 515, Cuda 11.7, Fedora 36
When running a similar training like yours, so with input size 416 on Coco128 (I assume you mean that with slice of coco), I too get the warning.
Transferred 481/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
21190557 20.8 2.816 34.12 119.1 (1, 3, 416, 416) list
21190557 41.61 2.852 31.86 149.8 (2, 3, 416, 416) list
21190557 83.21 2.842 30.95 173.2 (4, 3, 416, 416) list
21190557 166.4 2.772 46.22 201.2 (8, 3, 416, 416) list
21190557 332.9 2.741 82.16 318.4 (16, 3, 416, 416) list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 2.74G/7.92G (35%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0005), 82 bias
train: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
AutoAnchor: 4.18 anchors/target, 0.977 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING: Extremely small objects found: 16 of 929 labels are < 3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 927 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.6699: 100%|██████████| 1000/1000 [00:00<00:00, 1421.07it/s]
AutoAnchor: thr=0.25: 0.9935 best possible recall, 3.75 anchors past thr
AutoAnchor: n=9, img_size=416, metric_all=0.263/0.670-mean/best, past_thr=0.477-mean: 6,9, 16,14, 21,35, 55,47, 70,94, 80,188, 190,139, 216,249, 388,283
AutoAnchor: Done ✅ (optional: update model *.yaml to use these anchors in the future)
Plotting labels to runs/train/exp7/labels.jpg...
Image sizes 416 train, 416 val
Using 8 dataloader workers
Logging results to runs/train/exp7
Starting training for 300 epochs...
When not inputting an input size and 640 is used, I do not get the CUDA enviornment warning.
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
21190557 49.24 2.802 46.01 252.1 (1, 3, 640, 640) list
21190557 98.48 2.751 36.03 286 (2, 3, 640, 640) list
21190557 197 2.724 50.43 334.9 (4, 3, 640, 640) list
21190557 393.9 2.810 94.75 416.9 (8, 3, 640, 640) list
21190557 787.8 5.492 180.2 560.2 (16, 3, 640, 640) list
AutoBatch: Using batch-size 11 for CUDA:0 4.18G/7.92G (53%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.000515625), 82 bias
train: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
val: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]
AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp6/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp6
Starting training for 300 epochs...
I too have this random chunk of 2.53 in my GPU memory. I do not know what it is neither, but it does not match with my usage in nvitop before training start (around 500-600mb, with gnome desktop and xorg on). Checking back from autobatch trainings on the initial release of v6.2 , I do see:
[34m[1mAutoBatch: [0mComputing optimal batch size for --imgsz 1280
[34m[1mAutoBatch: [0mCUDA:0 (NVIDIA A100-SXM-80GB) 79.35G total, 0.11G reserved, 0.10G allocated, 79.14G free
1660813541500 os-nvme-hbwt3yi6-2a100-44v-fin1 info Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
12403204 65.97 0.872 29.57 20.63 (1, 3, 1280, 1280) list
12403204 131.9 1.629 26.51 22.46 (2, 3, 1280, 1280) list
12403204 263.9 3.181 31.44 28.91 (4, 3, 1280, 1280) list
1660813542584 os-nvme-hbwt3yi6-2a100-44v-fin1 info 12403204 527.8 5.981 33.11 43.87 (8, 3, 1280, 1280) list
12403204 1056 12.092 54.42 79.66 (16, 3, 1280, 1280) list
1660813543232 os-nvme-hbwt3yi6-2a100-44v-fin1 error [34m[1mAutoBatch: [0mUsing batch-size 95 for CUDA:0 70.93G/79.35G (89%) ✅
[34m[1moptimizer:[0m SGD(lr=0.01) with parameter groups 75 weight(decay=0.0), 79 weight(decay=0.0007421875), 79 bias
Hi @Denizzje and a happy start of the week to you, too. The monday wasn't so bad actually ;)
I'll add another data point today. This time yolov5m
Transferred 475/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.52G reserved, 0.16G allocated, 5.11G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
20879400 48.25 2.798 33.85 180.3 (1, 3, 640, 640) list
20879400 96.49 2.749 14.16 186.3 (2, 3, 640, 640) list
20879400 193 2.718 20.15 218.2 (4, 3, 640, 640) list
20879400 386 2.720 37 282.9 (8, 3, 640, 640) list
20879400 771.9 5.207 71.34 334.4 (16, 3, 640, 640) list
AutoBatch: Using batch-size 11 for CUDA:0 4.01G/7.79G (51%) ✅
But manually a batch of 16 is fine.
So yeah, looks like the 2.52G reserved do interfere, distort the testing for the size and then make the interpolation invalid. Maybe it'd be better to go by the 0.16G allocated instead, because that is also what nvtop is basically showing.
Instead of just saying anomaly detected it'd be also useful to hint below, that the initial VRAM usage/reserved is suspiciously high.
Update: I forgot to mention, looks that for your img 640 test, despite no warning, the autobatch result is still not optimal either, as the first 4 results are all around 2.8GB
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
21190557 49.24 2.802 46.01 252.1 (1, 3, 640, 640) list
21190557 98.48 2.751 36.03 286 (2, 3, 640, 640) list
21190557 197 2.724 50.43 334.9 (4, 3, 640, 640) list
21190557 393.9 2.810 94.75 416.9 (8, 3, 640, 640) list
21190557 787.8 5.492 180.2 560.2 (16, 3, 640, 640) list
AutoBatch: Using batch-size 11 for CUDA:0 4.18G/7.92G (53%) ✅
@alexk-ede AutoBatch may produce inaccurate results under certain circumstances, i.e. when previous trainings are in progress or have terminated early or not all CUDA memory has been released. If you find ways to improve please let us know, the relevant code is here: https://github.com/ultralytics/yolov5/blob/master/utils/autobatch.py
Yes, I saw that file after I was investigating where the warning message came from. That's where I learned about the interpolation, too.
i.e. when previous trainings are in progress or have terminated early or not all CUDA memory has been released.
I could understand that, if there was something using the GPU before, then that may be plausible.
But as I said, this is a completely fresh boot and fresh start of the environment. Nothing was run before, and obviously no trainings in progress, as it says 0.16G allocated. I still have absolutely no clue, where this 2.53G reserved are coming from (and what it means). Sure, there is some documentation, but not quite helpful. https://pytorch.org/docs/stable/generated/torch.cuda.memory_reserved.html
This looks like it makes more sense https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html
I'd obviously prefer to have a command that gives me the same output as nvtop.
I'll try later to use it here with memory_allocated instead https://github.com/ultralytics/yolov5/blob/1aea74cddbc78e7f79dac07090cb157dfc24dbcc/utils/torch_utils.py#L189 as it's called by autobatch here https://github.com/ultralytics/yolov5/blob/1aea74cddbc78e7f79dac07090cb157dfc24dbcc/utils/autobatch.py#L51
I decided I'll test what will happen, when I'll run this during a training session that already uses most of the gpu memory.
def autobatch(model, imgsz=640, fraction=0.8, batch_size=16):
# Automatically estimate best batch size to use `fraction` of available CUDA memory
# Usage:
# import torch
# from utils.autobatch import autobatch
# model = torch.hub.load('ultralytics/yolov5', 'yolov5s', autoshape=False)
# print(autobatch(model))
Turns out it fails on the first try and the results list doesn't get initialized at all.
In [4]: print(autobatch(model))
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.03G allocated, 7.73G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 7.79 GiB total capacity; 37.38 MiB already allocated; 6.06 MiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.79 GiB total capacity; 46.76 MiB already allocated; 6.06 MiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
AutoBatch: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 7.79 GiB total capacity; 28.01 MiB already allocated; 26.06 MiB free; 36.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 print(autobatch(model))
File ~/repo/yolov5/utils/autobatch.py:56, in autobatch(model, imgsz, fraction, batch_size)
53 LOGGER.warning(f'{prefix}{e}')
55 # Fit a solution
---> 56 y = [x[2] for x in results if x] # memory [2]
57 p = np.polyfit(batch_sizes[:len(y)], y, deg=1) # first degree polynomial fit
58 b = int((f * fraction - p[1]) / p[0]) # y intercept (optimal batch size)
UnboundLocalError: local variable 'results' referenced before assignment
But in this case 0.03G allocated is also completely wrong, because the real usage is
GPU[||||||||||||||||||||||||||||||||92%] MEM[||||||||||||||||||||7.975Gi/8.000Gi]
So as planned, I'll try memory_allocated later instead, but it also yields weird results. Now I see where the main problem is ...
This is quite weird, I just quickly tested this demo code while a training is running and using 6.6GB VRAM.
import torch
print(torch.__version__)
my_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32, device="cpu")
print(my_tensor)
torch.cuda.is_available()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Using device:', device)
print()
print("torch.cuda.is_available()")
print(torch.cuda.is_available())
#Additional Info when using cuda
if device.type == 'cuda':
print("torch.cuda.current_device()")
print(torch.cuda.current_device())
print("torch.cuda.device(0)")
print(torch.cuda.device(0))
print("torch.cuda.get_device_name(0)")
print(torch.cuda.get_device_name(0))
print()
print('Memory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached: ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
instead I only get now
<torch.cuda.device object at 0x7f24a6bb09d0>
torch.cuda.get_device_name(0)
NVIDIA GeForce RTX 3070
Memory Usage:
Allocated: 0.0 GB
Cached: 0.0 GB
So I don't know how to fix that with only using the interface that torch provides.
The command nvidia-smi --query-gpu=memory.used --format=csv,nounits,noheader
outputs the exact current usage that is also seen in nvtop.
So sure, one could add some functions like these to extract the GPU memory usage, but it's also not a very clean solution https://www.programcreek.com/python/?CodeExample=get+gpu+memory
There does seem to be something very wrong with the auto batch size at the moment. I believe it started after this "CUDA Anomaly" detected was implemented, though I did not do much trainings after a big batch right after the release of 6.2.
This time, I tried with latest YoloV5 from master, PyTorch 1.12, Ubuntu 20.04, Python 3.8 with Nvidia drivers 515 and CUDA 11.7 with an A100 80GB SXM GPU. The dataset is my regular dataset of ~40k training pictures this time, so not Coco1128. It spits out the CUDA anomaly warning and then proceeds with a batch size of 16...
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-80GB) 79.21G total, 4.85G reserved, 0.16G allocated, 74.20G free
2022-09-16 15:14:19 Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
2022-09-16 15:14:20 20964261 48.52 5.283 101 133.7 (1, 3, 640, 640) list
2022-09-16 15:14:22 20964261 97.03 5.232 22.21 245.8 (2, 3, 640, 640) list
2022-09-16 15:14:23 20964261 194.1 5.203 18.63 158.6 (4, 3, 640, 640) list
2022-09-16 15:14:25 20964261 388.1 5.205 17.41 258.5 (8, 3, 640, 640) list
2022-09-16 15:14:27 20964261 776.3 5.203 23.9 318.2 (16, 3, 640, 640) list
2022-09-16 15:14:27
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 5.19G/79.21G (7%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0005), 82 bias
For reference, this was an earlier training with the same machine but with PyTorch 1.10, Cuda 11.3 on YoloV5 release 6.2 (not master from that time), with an earlier version of the same dataset (size is roughly the same though).
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM-80GB) 79.35G total, 0.29G reserved, 0.27G allocated, 78.78G free
2022-08-18 08:26:33
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
35397156 49.54 0.698 40.71 24.46 (1, 3, 640, 640) list
35397156 99.07 1.028 34.32 23.26 (2, 3, 640, 640) list
35397156 198.1 1.707 39.83 25.03 (4, 3, 640, 640) list
2022-08-18 08:26:34
35397156 396.3 3.108 36.23 27.72 (8, 3, 640, 640) list
35397156 792.6 5.895 46.38 39.48 (16, 3, 640, 640) list
2022-08-18 08:26:34
AutoBatch: Using batch-size 203 for CUDA:0 70.82G/79.35G (89%) ✅
optimizer: SGD(lr=0.01) with parameter groups 103 weight(decay=0.0), 107 weight(decay=0.0015859375), 107 bias
After my fixed size training run (batch size 128) is finished I will try to redo the autobatch on Yolov5 release 6.2. If this 80GB card is convinced there as well that it can only fit a batch size of 16 in its memory, then the cause is somewhere else. I am curious to see what happens if I retry with PyTorch 1.10 with latest master code.
@Denizzje yes I'm able to reproduce in Colab. Something is not correct. I'll add a TODO to investigate.
@Denizzje good news 😃! Your original issue may now be fixed ✅ in PR #9448. This avoids setting cudnn.benchmark=True
on init_seeds()
, and also adds a check to AutoBatch that this setting is not in place. After these changes AutoBatch now works correctly:
To receive this update:
git pull
from within your yolov5/
directory or git clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
sudo docker pull ultralytics/yolov5:latest
to update your image Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!
Awesome @glenn-jocher , did not expect this on a friday evening hehe. "Unfortunally" the A100 is still training and my GTX 1080 really cant handle my dataset properly anymore so I will wait untill its finished and then give it another try after pulling and report back ASAP if it can find its memory this time ;).
Top of the morning, @glenn-jocher ,
Happy to confirm that the A100 is now convinced it actually has 80GB of VRAM, and autobatch now gives me a batch size of 192. Also the "CUDA Anomaly is detected" is gone. This is even a "dirty" start, didnt start a new terminal or reboot the system from my previous training.
Transferred 475/481 items from yolov5m.pt
2022-09-17 10:27:12
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-80GB) 79.21G total, 0.25G reserved, 0.16G allocated, 78.80G free
2022-09-17 10:27:12
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
20964261 48.52 0.543 58.52 23.66 (1, 3, 640, 640) list
20964261 97.03 0.858 34.5 21.23 (2, 3, 640, 640) list
20964261 194.1 1.571 33.96 22.95 (4, 3, 640, 640) list
2022-09-17 10:27:13
20964261 388.1 2.917 35.03 25.31 (8, 3, 640, 640) list
2022-09-17 10:27:14
20964261 776.3 5.415 35.45 35.17 (16, 3, 640, 640) list
2022-09-17 10:27:14
AutoBatch: Using batch-size 192 for CUDA:0 62.74G/79.21G (79%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0015), 82 bias
Glad to see this very useful function back in action and thanks again for your quick work last night 😄 . Note my issue so I can close it but @alexk-ede is hopefully fine too when pulling the latest code from master.
@Denizzje great!
BTW we used to target 90% memory utilization but had some issues with smaller cards going over during training, which is why we dropped back to an 80% target. You can modify this fraction
variable here:
https://github.com/ultralytics/yolov5/blob/5e1a9553fbed73995c9b81e63ba41cc70fdf89de/utils/autobatch.py#L21-L28
Hi everyone, looks like it's going to be a good Monday today ;)
And indeed, it seems to work fine right now.
Transferred 475/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.24G reserved, 0.16G allocated, 7.39G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
20883441 20.39 0.371 40.03 17.61 (1, 3, 416, 416) list
20883441 40.78 0.482 20.08 13.69 (2, 3, 416, 416) list
20883441 81.56 0.778 22.58 14.58 (4, 3, 416, 416) list
20883441 163.1 1.277 21.24 19 (8, 3, 416, 416) list
20883441 326.2 2.374 31.53 34.19 (16, 3, 416, 416) list
AutoBatch: Using batch-size 42 for CUDA:0 5.84G/7.79G (75%) ✅
MEM[||||||||||||||||||||7.711Gi/8.000Gi]
I'm just not sure where the (75%) ✅ are coming from, if fraction=0.8 ... I'll check if there are some remains from my tests or not, but shouldn't be as I just checked out the latest master.
I'll have a few train runs to do soon, so I'll report back.
And yes, having it <= 80% makes sense, bc I also noticed, despite showing GPU_mem 6.42G in the epoch, the actual used gpu mem is what nvtop reports 7.711G . I guess this is some constant overhead whatever, so I expect it to be less noticeable on bigger systems.
@Denizzje what does your nvtop report when you have 62.74G/79.21G (79%) ✅
?
@alexk-ede 80% is the requested utilization, 75% is the predicted utilization (actual utilization will vary and is sometimes substantially different).
It's possible some of the difference is coming from running AutoBatch only on the free memory vs total memory displayed later.
@alexk-ede maybe I should re-add allocated and reserved amounts to the predicted amount for the final utilisation. This should be closer to 80%.
@alexk-ede good news 😃! Your original issue may now be fixed ✅ in PR #9491. This PR adds reserved and allocated memory to the final estimated utilization rate displayed, which should result in a value closer to the default requested 80%.
To receive this update:
git pull
from within your yolov5/
directory or git clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
sudo docker pull ultralytics/yolov5:latest
to update your image Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!
Hi, ok, well looks like the utilization is now too close ;) So here is an overview:
This one worked (but was PR https://github.com/ultralytics/yolov5/pull/9448 and before PR https://github.com/ultralytics/yolov5/pull/9491 ): (and was only 416 resolution)
Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
1769329 1.79 0.069 24.43 10.13 (1, 3, 416, 416) list
1769329 3.58 0.109 11.23 9.78 (2, 3, 416, 416) list
1769329 7.161 0.187 12.8 9.658 (4, 3, 416, 416) list
1769329 14.32 0.327 12.59 11.1 (8, 3, 416, 416) list
1769329 28.64 0.707 13.2 13.58 (16, 3, 416, 416) list
AutoBatch: Using batch-size 146 for CUDA:0 6.24G/7.79G (80%) ✅
nvtop:
MEM[||||||||||||||||||||7.400Gi/8.000Gi]
GPU_mem
5.95G
but these ones failed: now with PR https://github.com/ultralytics/yolov5/pull/9491
Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
1769329 4.237 0.115 33.26 13.79 (1, 3, 640, 640) list
1769329 8.474 0.218 12.77 10.72 (2, 3, 640, 640) list
1769329 16.95 0.419 13.02 11.39 (4, 3, 640, 640) list
1769329 33.9 0.875 13.27 14.14 (8, 3, 640, 640) list
1769329 67.79 1.705 16.68 21.99 (16, 3, 640, 640) list
AutoBatch: Using batch-size 58 for CUDA:0 6.23G/7.79G (80%) ✅
nvtop:
GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.742Gi/8.000Gi]
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/399 6.26G 0.106 0.03965 0.0384 340 640: 17%|█▋ | 76/439 [00:20<01:20, 4.50it/s]
Then I tried setting to 75% instead
Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
1769329 4.237 0.115 30.87 13.88 (1, 3, 640, 640) list
1769329 8.474 0.218 13.05 10.97 (2, 3, 640, 640) list
1769329 16.95 0.419 13.39 11.45 (4, 3, 640, 640) list
1769329 33.9 0.875 17.22 14.14 (8, 3, 640, 640) list
1769329 67.79 1.705 16.76 22.1 (16, 3, 640, 640) list
AutoBatch: Using batch-size 54 for CUDA:0 5.81G/7.79G (75%) ✅
nvtop:
GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.681Gi/8.000Gi]
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/399 6.19G 0.08888 0.04061 0.0315 267 640: 37%|███▋ | 176/472 [00:41<01:22, 3.59it/s]
It ran a bit longer, but then failed. Still rather odd that it ran for over 10 sec. Usually it fails instantly when it runs out of vram.
And last try with 70%
Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
1769329 4.237 0.115 32.82 14.39 (1, 3, 640, 640) list
1769329 8.474 0.218 12.9 10.98 (2, 3, 640, 640) list
1769329 16.95 0.419 13.29 11.3 (4, 3, 640, 640) list
1769329 33.9 0.875 13.56 14.32 (8, 3, 640, 640) list
1769329 67.79 1.705 16.73 21.93 (16, 3, 640, 640) list
AutoBatch: Using batch-size 50 for CUDA:0 5.38G/7.79G (69%) ✅
nvtop:
GPU[||||||||||||||||||||||||||||||| 89%] MEM[||||||||||||||||||||7.441Gi/8.000Gi]
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/399 5.94G 0.08148 0.04051 0.0283 311 640: 55%|█████▍ | 278/510 [00:57<00:44, 5.21it/s]
nvtop:
MEM[||||||||||||||||||||6.873Gi/8.000Gi]
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
1/399 5.32G 0.05974 0.03991 0.01754 242 640: 54%|█████▎ | 274/510 [00:52<00:46, 5.11it/s]
Seems to run. Initial vram usage is suspiciously high and causes this problem but then goes down in the next iterations. (but I observed the peak vram usage before, too, I just don't remember it causing so many problems to autobatch.)
Anyway, these 8GB cards will just have to work with a lower fraction, no other way around that. Now you need to add auto-fraction (depending on available vram), too 🤣
Update to the last ones, those results may be invalid. I just checked htop and dmesg, Looks like that was actually OOM now and I may have run out of normal ram (&swap) (which should not happen with the dataset that I'm currently testing with, but I'll investigate that. The 32GB ram of that machine were usually enough for that dataset. Not sure why it started to use additional swap)
Dataset itself is around 24gb cached in ram. All in all the ram usage was 32gb, now it is using 8-16GB of swap, too. fraction 0.8 still fails, but fraction 0.75 works with more swap.
Were there any other changes that could affect that ? (Also the swap is slowly rising/filling during the first 1-2 epochs).
@alexk-ede dataset caching is independent of CUDA usage, it either uses RAM or disk space.
Yes, I know, I'm using the --cache option to use RAM. Otherwise the CPU load is just insane and the CPU can't keep up with the GPU. Anyway, I need to investigate what changed bc that additional RAM/swap usage didn't happen before.
Hello @alexk-ede ,
I cannot check at the moment because I am doing a training on release 6.1 at the moment (no clearML and got deallocated overnight so I miss the original logs at the beginning).
Have you tried however, to try something else than yolov5n (yolov5m something), on that slice of COCO? Is it actually representable for your dataset / use case? Because I do remember when mucking about with Coco128 I actually crashed my training with autobatch and a yolov5n for instance.
@Denizzje yeah I tried various yolov5 sizes, mostly n, s, m, (sometimes l just for testing). But I never wrote down how much they consumed during training so I'm going to make some tests and a table after the current training round is finished.
Search before asking
Question
So I'm testing the autobatch feature which is pretty cool. It seemed to work fine last week, but this week for whatever reason (maybe bc it's Monday, who knows ...) I'm having issues with it.
I'm running the yolov5s (latest git checkout ofc) and getting this (when using --batch -1) Dataset is a slice from COCO
Meanwhile, the nvtop output is this before running the train.py So there isn't really anything in the GPU memory.
I am unsure about this from AutoBatch
The 2.20G reserved is weird, because I stopped everything (including gdm3), so nothing is running on the GPU. (besides the training process later).
And I can easily set batch to 80 and it works fine:
I obviously did the recommended restart environment and even restarted the machine. Autobatch still complained about around 2.20G reserved
Any ideas how I can investigate this ?
My guess is, the 2.2GB do mess up the interpolation for autobatch because the GPU_mem (GB) column doesn't make much sense.
Additional
Maybe the issue title should be changed to AutoBatch: CUDA anomaly detected
some additional system info
So not sure where the rest went (aka the difference to the 7.2GB in nvtop) ...