ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.48k stars 16.29k forks source link

AutoBatch: CUDA anomaly detected #9287

Closed alexk-ede closed 2 years ago

alexk-ede commented 2 years ago

Search before asking

Question

So I'm testing the autobatch feature which is pretty cool. It seemed to work fine last week, but this week for whatever reason (maybe bc it's Monday, who knows ...) I'm having issues with it.

I'm running the yolov5s (latest git checkout ofc) and getting this (when using --batch -1) Dataset is a slice from COCO

AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7027720       6.744         2.414         27.87         35.49        (1, 3, 416, 416)                    list
     7027720       13.49         1.378         23.52         50.14        (2, 3, 416, 416)                    list
     7027720       26.98         1.380          23.8         56.75        (4, 3, 416, 416)                    list
     7027720       53.95         0.648         22.86         71.21        (8, 3, 416, 416)                    list
     7027720       107.9         1.330         26.38         91.88       (16, 3, 416, 416)                    list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 0.96G/7.79G (12%) ✅

Meanwhile, the nvtop output is this before running the train.py So there isn't really anything in the GPU memory.

Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
 GPU 210MHz  MEM 405MHz  TEMP  53°C FAN  38% POW  19 / 220 W
 GPU[                                 0%] MEM[|                   0.208Gi/8.000Gi]

I am unsure about this from AutoBatch

7.79G total, 2.20G reserved, 0.05G allocated, 5.54G free

The 2.20G reserved is weird, because I stopped everything (including gdm3), so nothing is running on the GPU. (besides the training process later).

And I can easily set batch to 80 and it works fine:

 Device 0 [NVIDIA GeForce RTX 3070] PCIe GEN 3@16x RX: 30.27 MiB/s TX: 8.789 MiB/s
 GPU 1905MHz MEM 6800MHz TEMP  68°C FAN  63% POW 199 / 220 W
 GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.319Gi/8.000Gi]
    PID USER DEV    TYPE  GPU        GPU MEM    CPU  HOST MEM Command
   6404 user   0 Compute  91%   7237MiB  88%   105%  14616MiB python train.py --img 416 --batch 80 --epochs 400  --cache --weights yolov5s.pt --data ...

I obviously did the recommended restart environment and even restarted the machine. Autobatch still complained about around 2.20G reserved

Any ideas how I can investigate this ?

My guess is, the 2.2GB do mess up the interpolation for autobatch because the GPU_mem (GB) column doesn't make much sense.

  GPU_mem (GB)       input 
   2.414       (1, 3, 416, 416)
   1.378       (2, 3, 416, 416)
   1.380       (4, 3, 416, 416)
   0.648       (8, 3, 416, 416) 
   1.330       (16, 3, 416, 416)

Additional

nvidia-smi
Mon Sep  5 16:22:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
github-actions[bot] commented 2 years ago

👋 Hello @alexk-ede, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Denizzje commented 2 years ago

Hi and happy monday to you.

I use autobatch quite frequently and also just updated my local yolov5 today so I took a look.

GTX 1080 (8GB), PyTorch 1.12, Nvidia driver 515, Cuda 11.7, Fedora 36

When running a similar training like yours, so with input size 416 on Coco128 (I assume you mean that with slice of coco), I too get the warning.

Transferred 481/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    21190557        20.8         2.816         34.12         119.1        (1, 3, 416, 416)                    list
    21190557       41.61         2.852         31.86         149.8        (2, 3, 416, 416)                    list
    21190557       83.21         2.842         30.95         173.2        (4, 3, 416, 416)                    list
    21190557       166.4         2.772         46.22         201.2        (8, 3, 416, 416)                    list
    21190557       332.9         2.741         82.16         318.4       (16, 3, 416, 416)                    list
AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.
AutoBatch: Using batch-size 16 for CUDA:0 2.74G/7.92G (35%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0005), 82 bias
train: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                      
val: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                        

AutoAnchor: 4.18 anchors/target, 0.977 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve...
AutoAnchor: WARNING: Extremely small objects found: 16 of 929 labels are < 3 pixels in size
AutoAnchor: Running kmeans for 9 anchors on 927 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.6699: 100%|██████████| 1000/1000 [00:00<00:00, 1421.07it/s]                                                                                                                 
AutoAnchor: thr=0.25: 0.9935 best possible recall, 3.75 anchors past thr
AutoAnchor: n=9, img_size=416, metric_all=0.263/0.670-mean/best, past_thr=0.477-mean: 6,9, 16,14, 21,35, 55,47, 70,94, 80,188, 190,139, 216,249, 388,283
AutoAnchor: Done ✅ (optional: update model *.yaml to use these anchors in the future)
Plotting labels to runs/train/exp7/labels.jpg... 
Image sizes 416 train, 416 val
Using 8 dataloader workers
Logging results to runs/train/exp7
Starting training for 300 epochs...

When not inputting an input size and 640 is used, I do not get the CUDA enviornment warning.

AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    21190557       49.24         2.802         46.01         252.1        (1, 3, 640, 640)                    list
    21190557       98.48         2.751         36.03           286        (2, 3, 640, 640)                    list
    21190557         197         2.724         50.43         334.9        (4, 3, 640, 640)                    list
    21190557       393.9         2.810         94.75         416.9        (8, 3, 640, 640)                    list
    21190557       787.8         5.492         180.2         560.2       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 11 for CUDA:0 4.18G/7.92G (53%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.000515625), 82 bias
train: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                                                                                                      
val: Scanning '/home/xxx/ai_dev/datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupt: 100%|██████████| 128/128 [00:00<?, ?it/s]                                                                                                                                        

AutoAnchor: 4.27 anchors/target, 0.994 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp6/labels.jpg... 
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp6
Starting training for 300 epochs...

I too have this random chunk of 2.53 in my GPU memory. I do not know what it is neither, but it does not match with my usage in nvitop before training start (around 500-600mb, with gnome desktop and xorg on). Checking back from autobatch trainings on the initial release of v6.2 , I do see:

AutoBatch: Computing optimal batch size for --imgsz 1280
AutoBatch: CUDA:0 (NVIDIA A100-SXM-80GB) 79.35G total, 0.11G reserved, 0.10G allocated, 79.14G free
1660813541500 os-nvme-hbwt3yi6-2a100-44v-fin1 info       Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    12403204       65.97         0.872         29.57         20.63      (1, 3, 1280, 1280)                    list
    12403204       131.9         1.629         26.51         22.46      (2, 3, 1280, 1280)                    list
    12403204       263.9         3.181         31.44         28.91      (4, 3, 1280, 1280)                    list
1660813542584 os-nvme-hbwt3yi6-2a100-44v-fin1 info     12403204       527.8         5.981         33.11         43.87      (8, 3, 1280, 1280)                    list
    12403204        1056        12.092         54.42         79.66     (16, 3, 1280, 1280)                    list
1660813543232 os-nvme-hbwt3yi6-2a100-44v-fin1 error AutoBatch: Using batch-size 95 for CUDA:0 70.93G/79.35G (89%) ✅
optimizer: SGD(lr=0.01) with parameter groups 75 weight(decay=0.0), 79 weight(decay=0.0007421875), 79 bias
alexk-ede commented 2 years ago

Hi @Denizzje and a happy start of the week to you, too. The monday wasn't so bad actually ;)

I'll add another data point today. This time yolov5m

Transferred 475/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 2.52G reserved, 0.16G allocated, 5.11G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    20879400       48.25         2.798         33.85         180.3        (1, 3, 640, 640)                    list
    20879400       96.49         2.749         14.16         186.3        (2, 3, 640, 640)                    list
    20879400         193         2.718         20.15         218.2        (4, 3, 640, 640)                    list
    20879400         386         2.720            37         282.9        (8, 3, 640, 640)                    list
    20879400       771.9         5.207         71.34         334.4       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 11 for CUDA:0 4.01G/7.79G (51%) ✅

But manually a batch of 16 is fine.

So yeah, looks like the 2.52G reserved do interfere, distort the testing for the size and then make the interpolation invalid. Maybe it'd be better to go by the 0.16G allocated instead, because that is also what nvtop is basically showing.

Instead of just saying anomaly detected it'd be also useful to hint below, that the initial VRAM usage/reserved is suspiciously high.

Update: I forgot to mention, looks that for your img 640 test, despite no warning, the autobatch result is still not optimal either, as the first 4 results are all around 2.8GB

AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce GTX 1080) 7.92G total, 2.53G reserved, 0.16G allocated, 5.24G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    21190557       49.24         2.802         46.01         252.1        (1, 3, 640, 640)                    list
    21190557       98.48         2.751         36.03           286        (2, 3, 640, 640)                    list
    21190557         197         2.724         50.43         334.9        (4, 3, 640, 640)                    list
    21190557       393.9         2.810         94.75         416.9        (8, 3, 640, 640)                    list
    21190557       787.8         5.492         180.2         560.2       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 11 for CUDA:0 4.18G/7.92G (53%) ✅
glenn-jocher commented 2 years ago

@alexk-ede AutoBatch may produce inaccurate results under certain circumstances, i.e. when previous trainings are in progress or have terminated early or not all CUDA memory has been released. If you find ways to improve please let us know, the relevant code is here: https://github.com/ultralytics/yolov5/blob/master/utils/autobatch.py

alexk-ede commented 2 years ago

Yes, I saw that file after I was investigating where the warning message came from. That's where I learned about the interpolation, too.

i.e. when previous trainings are in progress or have terminated early or not all CUDA memory has been released.

I could understand that, if there was something using the GPU before, then that may be plausible.

But as I said, this is a completely fresh boot and fresh start of the environment. Nothing was run before, and obviously no trainings in progress, as it says 0.16G allocated. I still have absolutely no clue, where this 2.53G reserved are coming from (and what it means). Sure, there is some documentation, but not quite helpful. https://pytorch.org/docs/stable/generated/torch.cuda.memory_reserved.html

This looks like it makes more sense https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html

I'd obviously prefer to have a command that gives me the same output as nvtop.

I'll try later to use it here with memory_allocated instead https://github.com/ultralytics/yolov5/blob/1aea74cddbc78e7f79dac07090cb157dfc24dbcc/utils/torch_utils.py#L189 as it's called by autobatch here https://github.com/ultralytics/yolov5/blob/1aea74cddbc78e7f79dac07090cb157dfc24dbcc/utils/autobatch.py#L51

alexk-ede commented 2 years ago

I decided I'll test what will happen, when I'll run this during a training session that already uses most of the gpu memory.

def autobatch(model, imgsz=640, fraction=0.8, batch_size=16):
    # Automatically estimate best batch size to use `fraction` of available CUDA memory
    # Usage:
    #     import torch
    #     from utils.autobatch import autobatch
    #     model = torch.hub.load('ultralytics/yolov5', 'yolov5s', autoshape=False)
    #     print(autobatch(model))

Turns out it fails on the first try and the results list doesn't get initialized at all.


In [4]: print(autobatch(model))
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.03G allocated, 7.73G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 7.79 GiB total capacity; 37.38 MiB already allocated; 6.06 MiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 7.79 GiB total capacity; 46.76 MiB already allocated; 6.06 MiB free; 56.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
AutoBatch: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 7.79 GiB total capacity; 28.01 MiB already allocated; 26.06 MiB free; 36.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 print(autobatch(model))

File ~/repo/yolov5/utils/autobatch.py:56, in autobatch(model, imgsz, fraction, batch_size)
     53     LOGGER.warning(f'{prefix}{e}')
     55 # Fit a solution
---> 56 y = [x[2] for x in results if x]  # memory [2]
     57 p = np.polyfit(batch_sizes[:len(y)], y, deg=1)  # first degree polynomial fit
     58 b = int((f * fraction - p[1]) / p[0])  # y intercept (optimal batch size)

UnboundLocalError: local variable 'results' referenced before assignment

But in this case 0.03G allocated is also completely wrong, because the real usage is GPU[||||||||||||||||||||||||||||||||92%] MEM[||||||||||||||||||||7.975Gi/8.000Gi]

So as planned, I'll try memory_allocated later instead, but it also yields weird results. Now I see where the main problem is ...

alexk-ede commented 2 years ago

This is quite weird, I just quickly tested this demo code while a training is running and using 6.6GB VRAM.

import torch

print(torch.__version__)
my_tensor = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32, device="cpu")
print(my_tensor)
torch.cuda.is_available()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print('Using device:', device)
print()

print("torch.cuda.is_available()")
print(torch.cuda.is_available())

#Additional Info when using cuda
if device.type == 'cuda':
    print("torch.cuda.current_device()")
    print(torch.cuda.current_device())

    print("torch.cuda.device(0)")
    print(torch.cuda.device(0))

    print("torch.cuda.get_device_name(0)")
    print(torch.cuda.get_device_name(0))

    print()
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

instead I only get now

<torch.cuda.device object at 0x7f24a6bb09d0>
torch.cuda.get_device_name(0)
NVIDIA GeForce RTX 3070

Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

So I don't know how to fix that with only using the interface that torch provides.

The command nvidia-smi --query-gpu=memory.used --format=csv,nounits,noheader outputs the exact current usage that is also seen in nvtop. So sure, one could add some functions like these to extract the GPU memory usage, but it's also not a very clean solution https://www.programcreek.com/python/?CodeExample=get+gpu+memory

Denizzje commented 2 years ago

There does seem to be something very wrong with the auto batch size at the moment. I believe it started after this "CUDA Anomaly" detected was implemented, though I did not do much trainings after a big batch right after the release of 6.2.

This time, I tried with latest YoloV5 from master, PyTorch 1.12, Ubuntu 20.04, Python 3.8 with Nvidia drivers 515 and CUDA 11.7 with an A100 80GB SXM GPU. The dataset is my regular dataset of ~40k training pictures this time, so not Coco1128. It spits out the CUDA anomaly warning and then proceeds with a batch size of 16...

AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-80GB) 79.21G total, 4.85G reserved, 0.16G allocated, 74.20G free

    2022-09-16 15:14:19            Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output

    2022-09-16 15:14:20            20964261 48.52 5.283 101 133.7 (1, 3, 640, 640) list

    2022-09-16 15:14:22            20964261 97.03 5.232 22.21 245.8 (2, 3, 640, 640) list

    2022-09-16 15:14:23            20964261 194.1 5.203 18.63 158.6 (4, 3, 640, 640) list

    2022-09-16 15:14:25            20964261 388.1 5.205 17.41 258.5 (8, 3, 640, 640) list

    2022-09-16 15:14:27            20964261 776.3 5.203 23.9 318.2 (16, 3, 640, 640) list

    2022-09-16 15:14:27     

AutoBatch: WARNING: ⚠️ CUDA anomaly detected, recommend restart environment and retry command.

AutoBatch: Using batch-size 16 for CUDA:0 5.19G/79.21G (7%) ✅

optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0005), 82 bias

For reference, this was an earlier training with the same machine but with PyTorch 1.10, Cuda 11.3 on YoloV5 release 6.2 (not master from that time), with an earlier version of the same dataset (size is roughly the same though).

AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM-80GB) 79.35G total, 0.29G reserved, 0.27G allocated, 78.78G free
2022-08-18 08:26:33
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    35397156       49.54         0.698         40.71         24.46        (1, 3, 640, 640)                    list
    35397156       99.07         1.028         34.32         23.26        (2, 3, 640, 640)                    list
    35397156       198.1         1.707         39.83         25.03        (4, 3, 640, 640)                    list
2022-08-18 08:26:34
    35397156       396.3         3.108         36.23         27.72        (8, 3, 640, 640)                    list
    35397156       792.6         5.895         46.38         39.48       (16, 3, 640, 640)                    list
2022-08-18 08:26:34
AutoBatch: Using batch-size 203 for CUDA:0 70.82G/79.35G (89%) ✅
optimizer: SGD(lr=0.01) with parameter groups 103 weight(decay=0.0), 107 weight(decay=0.0015859375), 107 bias

After my fixed size training run (batch size 128) is finished I will try to redo the autobatch on Yolov5 release 6.2. If this 80GB card is convinced there as well that it can only fit a batch size of 16 in its memory, then the cause is somewhere else. I am curious to see what happens if I retry with PyTorch 1.10 with latest master code.

glenn-jocher commented 2 years ago

@Denizzje yes I'm able to reproduce in Colab. Something is not correct. I'll add a TODO to investigate.

glenn-jocher commented 2 years ago

@Denizzje traced to https://github.com/ultralytics/yolov5/commit/5d4787baabea694369ad95c7d762139eb9f04e56

glenn-jocher commented 2 years ago

@Denizzje good news 😃! Your original issue may now be fixed ✅ in PR #9448. This avoids setting cudnn.benchmark=True on init_seeds(), and also adds a check to AutoBatch that this setting is not in place. After these changes AutoBatch now works correctly:

Screenshot 2022-09-16 at 21 33 47

To receive this update:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Denizzje commented 2 years ago

Awesome @glenn-jocher , did not expect this on a friday evening hehe. "Unfortunally" the A100 is still training and my GTX 1080 really cant handle my dataset properly anymore so I will wait untill its finished and then give it another try after pulling and report back ASAP if it can find its memory this time ;).

Denizzje commented 2 years ago

Top of the morning, @glenn-jocher ,

Happy to confirm that the A100 is now convinced it actually has 80GB of VRAM, and autobatch now gives me a batch size of 192. Also the "CUDA Anomaly is detected" is gone. This is even a "dirty" start, didnt start a new terminal or reboot the system from my previous training.

Transferred 475/481 items from yolov5m.pt
2022-09-17 10:27:12
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA A100-SXM4-80GB) 79.21G total, 0.25G reserved, 0.16G allocated, 78.80G free
2022-09-17 10:27:12
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    20964261       48.52         0.543         58.52         23.66        (1, 3, 640, 640)                    list
    20964261       97.03         0.858          34.5         21.23        (2, 3, 640, 640)                    list
    20964261       194.1         1.571         33.96         22.95        (4, 3, 640, 640)                    list
2022-09-17 10:27:13
    20964261       388.1         2.917         35.03         25.31        (8, 3, 640, 640)                    list
2022-09-17 10:27:14
    20964261       776.3         5.415         35.45         35.17       (16, 3, 640, 640)                    list
2022-09-17 10:27:14
AutoBatch: Using batch-size 192 for CUDA:0 62.74G/79.21G (79%) ✅
optimizer: SGD(lr=0.01) with parameter groups 79 weight(decay=0.0), 82 weight(decay=0.0015), 82 bias

Glad to see this very useful function back in action and thanks again for your quick work last night 😄 . Note my issue so I can close it but @alexk-ede is hopefully fine too when pulling the latest code from master.

glenn-jocher commented 2 years ago

@Denizzje great!

BTW we used to target 90% memory utilization but had some issues with smaller cards going over during training, which is why we dropped back to an 80% target. You can modify this fraction variable here: https://github.com/ultralytics/yolov5/blob/5e1a9553fbed73995c9b81e63ba41cc70fdf89de/utils/autobatch.py#L21-L28

alexk-ede commented 2 years ago

Hi everyone, looks like it's going to be a good Monday today ;)

And indeed, it seems to work fine right now.

Transferred 475/481 items from yolov5m.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.24G reserved, 0.16G allocated, 7.39G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
    20883441       20.39         0.371         40.03         17.61        (1, 3, 416, 416)                    list
    20883441       40.78         0.482         20.08         13.69        (2, 3, 416, 416)                    list
    20883441       81.56         0.778         22.58         14.58        (4, 3, 416, 416)                    list
    20883441       163.1         1.277         21.24            19        (8, 3, 416, 416)                    list
    20883441       326.2         2.374         31.53         34.19       (16, 3, 416, 416)                    list
AutoBatch: Using batch-size 42 for CUDA:0 5.84G/7.79G (75%) ✅

MEM[||||||||||||||||||||7.711Gi/8.000Gi]

I'm just not sure where the (75%) ✅ are coming from, if fraction=0.8 ... I'll check if there are some remains from my tests or not, but shouldn't be as I just checked out the latest master.

I'll have a few train runs to do soon, so I'll report back.

And yes, having it <= 80% makes sense, bc I also noticed, despite showing GPU_mem 6.42G in the epoch, the actual used gpu mem is what nvtop reports 7.711G . I guess this is some constant overhead whatever, so I expect it to be less noticeable on bigger systems.

@Denizzje what does your nvtop report when you have 62.74G/79.21G (79%) ✅ ?

glenn-jocher commented 2 years ago

@alexk-ede 80% is the requested utilization, 75% is the predicted utilization (actual utilization will vary and is sometimes substantially different).

It's possible some of the difference is coming from running AutoBatch only on the free memory vs total memory displayed later.

glenn-jocher commented 2 years ago

@alexk-ede maybe I should re-add allocated and reserved amounts to the predicted amount for the final utilisation. This should be closer to 80%.

glenn-jocher commented 2 years ago

@alexk-ede good news 😃! Your original issue may now be fixed ✅ in PR #9491. This PR adds reserved and allocated memory to the final estimated utilization rate displayed, which should result in a value closer to the default requested 80%.

To receive this update:

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

alexk-ede commented 2 years ago

Hi, ok, well looks like the utilization is now too close ;) So here is an overview:

This one worked (but was PR https://github.com/ultralytics/yolov5/pull/9448 and before PR https://github.com/ultralytics/yolov5/pull/9491 ): (and was only 416 resolution)

Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 416
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329        1.79         0.069         24.43         10.13        (1, 3, 416, 416)                    list
     1769329        3.58         0.109         11.23          9.78        (2, 3, 416, 416)                    list
     1769329       7.161         0.187          12.8         9.658        (4, 3, 416, 416)                    list
     1769329       14.32         0.327         12.59          11.1        (8, 3, 416, 416)                    list
     1769329       28.64         0.707          13.2         13.58       (16, 3, 416, 416)                    list
AutoBatch: Using batch-size 146 for CUDA:0 6.24G/7.79G (80%) ✅

nvtop:
MEM[||||||||||||||||||||7.400Gi/8.000Gi]

GPU_mem   
5.95G   

but these ones failed: now with PR https://github.com/ultralytics/yolov5/pull/9491

Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329       4.237         0.115         33.26         13.79        (1, 3, 640, 640)                    list
     1769329       8.474         0.218         12.77         10.72        (2, 3, 640, 640)                    list
     1769329       16.95         0.419         13.02         11.39        (4, 3, 640, 640)                    list
     1769329        33.9         0.875         13.27         14.14        (8, 3, 640, 640)                    list
     1769329       67.79         1.705         16.68         21.99       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 58 for CUDA:0 6.23G/7.79G (80%) ✅

nvtop:
GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.742Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      0/399      6.26G      0.106    0.03965     0.0384        340        640:  17%|█▋        | 76/439 [00:20<01:20,  4.50it/s] 

Then I tried setting to 75% instead


Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329       4.237         0.115         30.87         13.88        (1, 3, 640, 640)                    list
     1769329       8.474         0.218         13.05         10.97        (2, 3, 640, 640)                    list
     1769329       16.95         0.419         13.39         11.45        (4, 3, 640, 640)                    list
     1769329        33.9         0.875         17.22         14.14        (8, 3, 640, 640)                    list
     1769329       67.79         1.705         16.76          22.1       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 54 for CUDA:0 5.81G/7.79G (75%) ✅

nvtop:
 GPU[||||||||||||||||||||||||||||||||90%] MEM[||||||||||||||||||||7.681Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      0/399      6.19G    0.08888    0.04061     0.0315        267        640:  37%|███▋      | 176/472 [00:41<01:22,  3.59it/s] 

It ran a bit longer, but then failed. Still rather odd that it ran for over 10 sec. Usually it fails instantly when it runs out of vram.

And last try with 70%


Transferred 343/349 items from yolov5n.pt
AMP: checks passed ✅
AutoBatch: Computing optimal batch size for --imgsz 640
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3070) 7.79G total, 0.04G reserved, 0.01G allocated, 7.74G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     1769329       4.237         0.115         32.82         14.39        (1, 3, 640, 640)                    list
     1769329       8.474         0.218          12.9         10.98        (2, 3, 640, 640)                    list
     1769329       16.95         0.419         13.29          11.3        (4, 3, 640, 640)                    list
     1769329        33.9         0.875         13.56         14.32        (8, 3, 640, 640)                    list
     1769329       67.79         1.705         16.73         21.93       (16, 3, 640, 640)                    list
AutoBatch: Using batch-size 50 for CUDA:0 5.38G/7.79G (69%) ✅

nvtop:
GPU[||||||||||||||||||||||||||||||| 89%] MEM[||||||||||||||||||||7.441Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      0/399      5.94G    0.08148    0.04051     0.0283        311        640:  55%|█████▍    | 278/510 [00:57<00:44,  5.21it/s]

nvtop:
MEM[||||||||||||||||||||6.873Gi/8.000Gi]

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      1/399      5.32G    0.05974    0.03991    0.01754        242        640:  54%|█████▎    | 274/510 [00:52<00:46,  5.11it/s]

Seems to run. Initial vram usage is suspiciously high and causes this problem but then goes down in the next iterations. (but I observed the peak vram usage before, too, I just don't remember it causing so many problems to autobatch.)

Anyway, these 8GB cards will just have to work with a lower fraction, no other way around that. Now you need to add auto-fraction (depending on available vram), too 🤣

alexk-ede commented 2 years ago

Update to the last ones, those results may be invalid. I just checked htop and dmesg, Looks like that was actually OOM now and I may have run out of normal ram (&swap) (which should not happen with the dataset that I'm currently testing with, but I'll investigate that. The 32GB ram of that machine were usually enough for that dataset. Not sure why it started to use additional swap)

Dataset itself is around 24gb cached in ram. All in all the ram usage was 32gb, now it is using 8-16GB of swap, too. fraction 0.8 still fails, but fraction 0.75 works with more swap.

Were there any other changes that could affect that ? (Also the swap is slowly rising/filling during the first 1-2 epochs).

glenn-jocher commented 2 years ago

@alexk-ede dataset caching is independent of CUDA usage, it either uses RAM or disk space.

alexk-ede commented 2 years ago

Yes, I know, I'm using the --cache option to use RAM. Otherwise the CPU load is just insane and the CPU can't keep up with the GPU. Anyway, I need to investigate what changed bc that additional RAM/swap usage didn't happen before.

Denizzje commented 2 years ago

Hello @alexk-ede ,

I cannot check at the moment because I am doing a training on release 6.1 at the moment (no clearML and got deallocated overnight so I miss the original logs at the beginning).

Have you tried however, to try something else than yolov5n (yolov5m something), on that slice of COCO? Is it actually representable for your dataset / use case? Because I do remember when mucking about with Coco128 I actually crashed my training with autobatch and a yolov5n for instance.

alexk-ede commented 2 years ago

@Denizzje yeah I tried various yolov5 sizes, mostly n, s, m, (sometimes l just for testing). But I never wrote down how much they consumed during training so I'm going to make some tests and a table after the current training round is finished.