Precision and Recall is zero during training

❔Question

I have 25 images I have used 15 for training and 10 for validation but my validation precision and recall score is 0. my images are satellite images like the following April 2020

With label image

Additional context

 Epoch   gpu_mem       box       obj       cls    labels  img_size
    12/149     7.59G   0.05084    0.1359         0      4486      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.69it/s]
                 all         10          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
    13/149     7.59G   0.05089    0.1082         0      2162      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.66it/s]
                 all         10          0          0          0          0          0

@hammadyounas2008 👋 Hello! Thanks for asking about improving YOLOv5 🚀 training results.

Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.

If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name directory, typically yolov5/runs/train/exp.

We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.

Dataset

Images per class. ≥ 1500 images per class recommended
Instances per class. ≥ 10000 instances (labeled objects) per class recommended
Image variety. Must be representative of deployed environment. For real-world use cases we recommend images from different times of day, different seasons, different weather, different lighting, different angles, different sources (scraped online, collected locally, different cameras) etc.
Label consistency. All instances of all classes in all images must be labelled. Partial labelling will not work.
Label accuracy. Labels must closely enclose each object. No space should exist between an object and it's bounding box. No objects should be missing a label.
Background images. Background images are images with no objects that are added to a dataset to reduce False Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for reference, 1% of the total). No labels are required for background images.

Model Selection

Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.

YOLOv5 Models

Start from Pretrained weights. Recommended for small to medium sized datasets (i.e. VOC, VisDrone, GlobalWheat). Pass the name of the model to the --weights argument. Models download automatically from the latest YOLOv5 release.

python train.py --data custom.yaml --weights yolov5s.pt
                                         yolov5m.pt
                                         yolov5l.pt
                                         yolov5x.pt
                                         custom_pretrained.pt

Start from Scratch. Recommended for large datasets (i.e. COCO, Objects365, OIv6). Pass the model architecture yaml you are interested in, along with an empty --weights '' argument:

python train.py --data custom.yaml --weights '' --cfg yolov5s.yaml
                                                  yolov5m.yaml
                                                  yolov5l.yaml
                                                  yolov5x.yaml

Training Settings

Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.

Epochs. Start with 300 epochs. If this overfits early then you can reduce epochs. If overfitting does not occur after 300 epochs, train longer, i.e. 600, 1200 etc epochs.
Image size. COCO trains at native resolution of --img 640, though due to the high amount of small objects in the dataset it can benefit from training at higher resolutions such as --img 1280. If there are many small objects then custom datasets will benefit from training at native or higher resolution. Best inference results are obtained at the same --img as the training was run at, i.e. if you train at --img 1280 you should also test and detect at --img 1280.
Batch size. Use the largest --batch-size that your hardware allows for. Small batch sizes produce poor batchnorm statistics and should be avoided.
Hyperparameters. Default hyperparameters are in hyp.scratch.yaml. We recommend you train with default hyperparameters first before thinking of modifying any. In general, increasing augmentation hyperparameters will reduce and delay overfitting, allowing for longer trainings and higher final mAP. Reduction in loss component gain hyperparameters like hyp['obj'] will help reduce overfitting in those specific loss components. For an automated method of optimizing these hyperparameters, see our Hyperparameter Evolution Tutorial.

Question

I have 25 images I have used 15 for training and 10 for validation but my validation precision and recall score is 0. my images are satellite images like the following April 2020

With label image

Additional context

 Epoch   gpu_mem       box       obj       cls    labels  img_size
    12/149     7.59G   0.05084    0.1359         0      4486      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.69it/s]
                 all         10          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
    13/149     7.59G   0.05089    0.1082         0      2162      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.66it/s]
                 all         10          0          0          0          0          0

I get same question too!Did you fix it?

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

I still have the same issue, I would like to at least overfit a small dataset before starting a training on a large one, as it is the case in the Yolo example, where we overfit the 128 first images of Coco dataset. How can I make sure that the problem is the small dataset without spending hours of training ?

@RobinGRAPIN it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Hi @glenn-jocher

I am facing same issue where my precision and recall are 0.

but in my case I changed yolo5m architecture like below :

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1   1345042  models.common.CBAMBottleneck            [512, 512, 3]
 10                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 11                -1  1    279616  models.common.Involution                [512, 512, 1, 1]
 12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 6]  1         0  models.common.Concat                    [1]
 15                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 16                -1  1     66048  models.common.Conv                      [256, 256, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 4]  1         0  models.common.Concat                    [1]
 19                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 20                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 2]  1         0  models.common.Concat                    [1]
 23                -1  1     82688  models.common.C3                        [192, 128, 1, False]
 24                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 25          [-1, 19]  1         0  models.common.Concat                    [1]
 26                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 27                -1  1    295168  models.common.Conv                      [256, 128, 3, 2]
 28          [-1, 15]  1         0  models.common.Concat                    [1]
 29                -1  1    107264  models.common.C3                        [384, 128, 1, False]
 30                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 31          [-1, 11]  1         0  models.common.Concat                    [1]
 32                -1  1   1313792  models.common.C3                        [768, 512, 1, False]
 33  [23, 26, 29, 32]  1     46260  models.yolo.Detect                      [10, [[2.9434, 4.0435, 3.8626, 8.5592, 6.8534, 5.9391], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 128, 512]]
yolo5m-cbam-involution summary: 285 layers, 9335494 parameters, 9335494 gradients, 31.2 GFLOPs.

accordingly I changed common.py and yolo.py. epochs are running without any error but P and R are zero. am I doing something wrong?

dataset I am using : VisDrone

python packages:

absl-py==2.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
contourpy==1.1.1
cycler==0.12.1
filelock==3.12.4
fonttools==4.43.1
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.40
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.0
idna==3.4
Jinja2==3.1.2
kiwisolver==1.4.5
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
networkx==3.2
numpy==1.26.1
oauthlib==3.2.2
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.1
Pillow==10.1.0
protobuf==4.23.4
psutil==5.9.6
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.1
thop==0.1.1.post2209072238
torch==2.1.0
torchvision==0.16.0
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
ultralytics==8.0.200
urllib3==2.0.7
Werkzeug==3.0.0

python version : python3.10.9 thanks in advance.

@aash1999 it seems like you've made quite a few changes to the YOLOv5m architecture, which can have a significant impact on performance metrics, especially precision and recall. Please keep in mind this is an advanced customization and may require careful debugging.

First, I'd recommend checking the VisDrone dataset annotations to ensure they're in the correct YOLO format. Also, verify that the class labels are consistent across your dataset and the model configuration.

Furthermore, please note your Python package versions are a bit outdated and a mix of various branches. Upgrading to the latest YOLOv5 package version or using the Docker image should help confirm if the issue is related to the architecture changes or the environment.

Besides, python version 3.10.9 is not supported by YOLOv5 yet. Please consider downgrading to a supported version, particularly Python>=3.7.0 before proceeding to avoid potential compatibility issues.

After addressing these points, if the issue persists, you could try to train the network without any architecture changes to establish a baseline detection performance and then gradually introduce modifications to better understand the impact.

In any case, I'd recommend ensuring that the dataset is suitable and well-prepared, and then verifying the model's performance on the original YOLOv5 architecture before introducing customizations.

Let me know if the issue persists after making these adjustments.

Hi @glenn-jocher

I am facing same issue where my precision and recall are 0.

but in my case I changed yolo5m architecture like below :

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1   1345042  models.common.CBAMBottleneck            [512, 512, 3]
 10                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 11                -1  1    279616  models.common.Involution                [512, 512, 1, 1]
 12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 6]  1         0  models.common.Concat                    [1]
 15                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 16                -1  1     66048  models.common.Conv                      [256, 256, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 4]  1         0  models.common.Concat                    [1]
 19                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 20                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 2]  1         0  models.common.Concat                    [1]
 23                -1  1     82688  models.common.C3                        [192, 128, 1, False]
 24                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 25          [-1, 19]  1         0  models.common.Concat                    [1]
 26                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 27                -1  1    295168  models.common.Conv                      [256, 128, 3, 2]
 28          [-1, 15]  1         0  models.common.Concat                    [1]
 29                -1  1    107264  models.common.C3                        [384, 128, 1, False]
 30                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 31          [-1, 11]  1         0  models.common.Concat                    [1]
 32                -1  1   1313792  models.common.C3                        [768, 512, 1, False]
 33  [23, 26, 29, 32]  1     46260  models.yolo.Detect                      [10, [[2.9434, 4.0435, 3.8626, 8.5592, 6.8534, 5.9391], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 128, 512]]
yolo5m-cbam-involution summary: 285 layers, 9335494 parameters, 9335494 gradients, 31.2 GFLOPs.

accordingly I changed common.py and yolo.py. epochs are running without any error but P and R are zero. am I doing something wrong?

dataset I am using : VisDrone

python packages:

absl-py==2.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
contourpy==1.1.1
cycler==0.12.1
filelock==3.12.4
fonttools==4.43.1
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.40
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.0
idna==3.4
Jinja2==3.1.2
kiwisolver==1.4.5
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
networkx==3.2
numpy==1.26.1
oauthlib==3.2.2
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.1
Pillow==10.1.0
protobuf==4.23.4
psutil==5.9.6
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.1
thop==0.1.1.post2209072238
torch==2.1.0
torchvision==0.16.0
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
ultralytics==8.0.200
urllib3==2.0.7
Werkzeug==3.0.0

python version : python3.10.9 thanks in advance.

Hi @glenn-jocher

I am facing same issue where my precision and recall are 0.

but in my case I changed yolo5m architecture like below :

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1   1345042  models.common.CBAMBottleneck            [512, 512, 3]
 10                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 11                -1  1    279616  models.common.Involution                [512, 512, 1, 1]
 12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 6]  1         0  models.common.Concat                    [1]
 15                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 16                -1  1     66048  models.common.Conv                      [256, 256, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 4]  1         0  models.common.Concat                    [1]
 19                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 20                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 2]  1         0  models.common.Concat                    [1]
 23                -1  1     82688  models.common.C3                        [192, 128, 1, False]
 24                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 25          [-1, 19]  1         0  models.common.Concat                    [1]
 26                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 27                -1  1    295168  models.common.Conv                      [256, 128, 3, 2]
 28          [-1, 15]  1         0  models.common.Concat                    [1]
 29                -1  1    107264  models.common.C3                        [384, 128, 1, False]
 30                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 31          [-1, 11]  1         0  models.common.Concat                    [1]
 32                -1  1   1313792  models.common.C3                        [768, 512, 1, False]
 33  [23, 26, 29, 32]  1     46260  models.yolo.Detect                      [10, [[2.9434, 4.0435, 3.8626, 8.5592, 6.8534, 5.9391], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 128, 512]]
yolo5m-cbam-involution summary: 285 layers, 9335494 parameters, 9335494 gradients, 31.2 GFLOPs.

accordingly I changed common.py and yolo.py. epochs are running without any error but P and R are zero. am I doing something wrong?

dataset I am using : VisDrone

python packages:

absl-py==2.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
contourpy==1.1.1
cycler==0.12.1
filelock==3.12.4
fonttools==4.43.1
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.40
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.0
idna==3.4
Jinja2==3.1.2
kiwisolver==1.4.5
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
networkx==3.2
numpy==1.26.1
oauthlib==3.2.2
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.1
Pillow==10.1.0
protobuf==4.23.4
psutil==5.9.6
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.1
thop==0.1.1.post2209072238
torch==2.1.0
torchvision==0.16.0
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
ultralytics==8.0.200
urllib3==2.0.7
Werkzeug==3.0.0

python version : python3.10.9 thanks in advance.

Hi @aash1999, just wondering if you have figured out the issue. I'd like to learn more if you are willing to share your experience. The original YOLOv5 architecture worked pretty well on my custom dataset before any customizations. However, P and R are always zero ever since I introduced the new loss.

Hi @fyang064,

Zero precision and recall after modifying the YOLOv5 architecture can be due to several reasons. Here are a few steps you can take to debug the issue:

Sanity Check: Ensure that your modified model is capable of overfitting a very small dataset (e.g., 1-2 images). If it cannot, there might be an issue with the architecture changes.
Data Loader: Verify that the data loader is correctly loading and preprocessing the images and labels. Check if the annotations are correct and match the input data.
Learning Rate: Sometimes, if the learning rate is too high, the model may not learn effectively. Try reducing the learning rate.
Loss Function: Confirm that the loss function is being calculated correctly and that gradients are flowing through the network as expected.
Model Outputs: Inspect the raw outputs of the model to ensure they are sensible (e.g., not all zeros or NaNs).
Backbone Pretraining: If you've introduced new layers or blocks (like CBAM or Involution), it might be beneficial to pretrain the backbone on a related task or dataset before fine-tuning on your target dataset.
Batch Size: A very small batch size can sometimes lead to unstable training, especially with batch normalization layers.
Anchor Boxes: If you've changed the architecture significantly, you might need to re-calculate the anchor boxes to better fit your dataset.
Environment: As mentioned before, ensure your environment matches the requirements for YOLOv5. Python 3.10 is not officially supported, so consider using Python 3.7 or 3.8.
Debugging Tools: Utilize debugging tools like printing shapes of tensors at various points, using PyTorch's torch.autograd.set_detect_anomaly(True), and visualizing feature maps.

Remember, when modifying architectures, it's crucial to make changes incrementally and test at each step to isolate where the issue might be occurring. If you're still facing issues, consider reverting to the last known good configuration and reintroducing changes one at a time.

Hi @fyang064,

Zero precision and recall after modifying the YOLOv5 architecture can be due to several reasons. Here are a few steps you can take to debug the issue:

Sanity Check: Ensure that your modified model is capable of overfitting a very small dataset (e.g., 1-2 images). If it cannot, there might be an issue with the architecture changes.

Data Loader: Verify that the data loader is correctly loading and preprocessing the images and labels. Check if the annotations are correct and match the input data.

Learning Rate: Sometimes, if the learning rate is too high, the model may not learn effectively. Try reducing the learning rate.

Loss Function: Confirm that the loss function is being calculated correctly and that gradients are flowing through the network as expected.

Model Outputs: Inspect the raw outputs of the model to ensure they are sensible (e.g., not all zeros or NaNs).

Backbone Pretraining: If you've introduced new layers or blocks (like CBAM or Involution), it might be beneficial to pretrain the backbone on a related task or dataset before fine-tuning on your target dataset.

Batch Size: A very small batch size can sometimes lead to unstable training, especially with batch normalization layers.

Anchor Boxes: If you've changed the architecture significantly, you might need to re-calculate the anchor boxes to better fit your dataset.

Environment: As mentioned before, ensure your environment matches the requirements for YOLOv5. Python 3.10 is not officially supported, so consider using Python 3.7 or 3.8.

Debugging Tools: Utilize debugging tools like printing shapes of tensors at various points, using PyTorch's torch.autograd.set_detect_anomaly(True), and visualizing feature maps.

Remember, when modifying architectures, it's crucial to make changes incrementally and test at each step to isolate where the issue might be occurring. If you're still facing issues, consider reverting to the last known good configuration and reintroducing changes one at a time.

Hi @glenn-jocher, appreciate your kind help and supportive advice. I have checked the code following your instructions and found that the output after going through process_batch function became insensible e.g., all zeros during the validation process. I guess the issue (all zeros for P, R, and AP during validation) occurred when I tried to introduce a new loss function, however, the training losses look normal to me. Look forward to your hand!

Hi @fyang064,

If the process_batch function is producing all zeros during validation, it suggests that the model's predictions are not matching any ground truth labels, which would indeed result in zero precision and recall. Here are a few additional steps to consider:

Loss Function: Double-check the implementation of your new loss function. Ensure that it's properly computing gradients and that it's compatible with the rest of the model. It's possible that the loss function works well during training but fails to generalize to validation data.
Output Activation: Verify that the activation functions at the output layer are appropriate for the task. For instance, object detection typically requires a sigmoid activation for the objectness score and class probabilities, and a linear activation for bounding box regression.
Thresholds: Check the confidence and IoU thresholds used during validation. If they are set too high, it might result in all detections being filtered out.
Data Augmentation: If you're using aggressive data augmentation, it might be overfitting to the training data and not generalizing well to the validation set. Try reducing or disabling augmentation to see if it affects the validation metrics.
Validation Data: Ensure that the validation data is correctly labeled and that the labels are in the correct format. Also, confirm that the validation dataset is representative of the training data.
Model Checkpoints: If you're loading weights from a checkpoint, ensure that the weights are compatible with the modified architecture.
Debugging: Use debugging statements to print out the predictions and targets just before they are passed to the loss function during validation. This can help you identify if the issue is with the model predictions or the processing of data.
Revert to Baseline: Temporarily revert to the original loss function and see if the validation metrics return to normal. This can help confirm whether the issue is with the new loss function.
Gradual Changes: Introduce the new loss function gradually, starting with a weighted combination of the old and new losses, and monitor the effect on validation metrics.
Consult the Community: If you're still stuck, consider reaching out to the community with details of your implementation. Sometimes, a fresh set of eyes can spot issues that are not immediately obvious.

Remember to make one change at a time and test thoroughly after each modification. This approach will help you isolate the problem more effectively. Good luck!

I'm also facing the same issue while using the ghost network and employing the Coordination Attention Mechanism with YOLOv5. However, I've updated the loss function of YOLOv5, but I didn't encounter this issue at that time.

Hi there!

It sounds like you're diving into some advanced customizations with YOLOv5 — that's awesome! 🚀 When incorporating complex structures like the ghost network and Coordination Attention Mechanism, it's crucial to ensure all parts are seamlessly integrated. If you've previously updated the loss function without issues, but are now encountering problems, consider closely examining the interfaces between these components and YOLOv5's architecture.

A quick tip: Pay special attention to the shape and format of inputs and outputs at each modification point. Also, debugging prints can be very helpful to verify that data flows as expected through your modified network.

If your loss values and training metrics look correct but validation suffers, it might be valuable to revisit the validation data preparation and ensure it's aligned with your model's expected input format.

Keep experimenting, and feel free to share snippets of your integration code for more targeted advice. Happy coding! 🚀

ultralytics / yolov5