ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.81k stars 16.12k forks source link

Precision and Recall is zero during training #4242

Closed ghost closed 2 years ago

ghost commented 3 years ago

❔Question

I have 25 images I have used 15 for training and 10 for validation but my validation precision and recall score is 0. my images are satellite images like the following April 2020

With label image rd

Additional context

 Epoch   gpu_mem       box       obj       cls    labels  img_size
    12/149     7.59G   0.05084    0.1359         0      4486      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.69it/s]
                 all         10          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
    13/149     7.59G   0.05089    0.1082         0      2162      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.66it/s]
                 all         10          0          0          0          0          0
glenn-jocher commented 3 years ago

@hammadyounas2008 πŸ‘‹ Hello! Thanks for asking about improving YOLOv5 πŸš€ training results.

Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.

If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name directory, typically yolov5/runs/train/exp.

We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.

Dataset

COCO Analysis

Model Selection

Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.

YOLOv5 Models

Training Settings

Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.

Further Reading

If you'd like to know more a good place to start is Karpathy's 'Recipe for Training Neural Networks', which has great ideas for training that apply broadly across all ML domains: http://karpathy.github.io/2019/04/25/recipe/

ghost commented 3 years ago

means I have to label more datasets?

iceisfun commented 3 years ago

You probably should have at least 1000

ghost commented 3 years ago

I have only 50 images because there is a large number of objects, what will I do now?

sunmengnan commented 3 years ago

you should collect more data

ghost commented 3 years ago

What about the detection of small objects?

WhXl commented 3 years ago

Question

I have 25 images I have used 15 for training and 10 for validation but my validation precision and recall score is 0. my images are satellite images like the following April 2020

With label image rd

Additional context

 Epoch   gpu_mem       box       obj       cls    labels  img_size
    12/149     7.59G   0.05084    0.1359         0      4486      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.69it/s]
                 all         10          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
    13/149     7.59G   0.05089    0.1082         0      2162      2016: 100% 8/8 [00:09<00:00,  1.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 3/3 [00:01<00:00,  2.66it/s]
                 all         10          0          0          0          0          0

I get same question too!Did you fix it?

github-actions[bot] commented 2 years ago

πŸ‘‹ Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 πŸš€ resources:

Access additional Ultralytics ⚑ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐!

RobinGRAPIN commented 2 years ago

I still have the same issue, I would like to at least overfit a small dataset before starting a training on a large one, as it is the case in the Yolo example, where we overfit the 128 first images of Coco dataset. How can I make sure that the problem is the small dataset without spending hours of training ?

glenn-jocher commented 2 years ago

@RobinGRAPIN it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

πŸ’‘ ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

aash1999 commented 11 months ago

Hi @glenn-jocher

I am facing same issue where my precision and recall are 0.

but in my case I changed yolo5m architecture like below :

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1   1345042  models.common.CBAMBottleneck            [512, 512, 3]
 10                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 11                -1  1    279616  models.common.Involution                [512, 512, 1, 1]
 12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 6]  1         0  models.common.Concat                    [1]
 15                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 16                -1  1     66048  models.common.Conv                      [256, 256, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 4]  1         0  models.common.Concat                    [1]
 19                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 20                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 2]  1         0  models.common.Concat                    [1]
 23                -1  1     82688  models.common.C3                        [192, 128, 1, False]
 24                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 25          [-1, 19]  1         0  models.common.Concat                    [1]
 26                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 27                -1  1    295168  models.common.Conv                      [256, 128, 3, 2]
 28          [-1, 15]  1         0  models.common.Concat                    [1]
 29                -1  1    107264  models.common.C3                        [384, 128, 1, False]
 30                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 31          [-1, 11]  1         0  models.common.Concat                    [1]
 32                -1  1   1313792  models.common.C3                        [768, 512, 1, False]
 33  [23, 26, 29, 32]  1     46260  models.yolo.Detect                      [10, [[2.9434, 4.0435, 3.8626, 8.5592, 6.8534, 5.9391], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 128, 512]]
yolo5m-cbam-involution summary: 285 layers, 9335494 parameters, 9335494 gradients, 31.2 GFLOPs.

accordingly I changed common.py and yolo.py. epochs are running without any error but P and R are zero. am I doing something wrong?

dataset I am using : VisDrone

python packages:

absl-py==2.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
contourpy==1.1.1
cycler==0.12.1
filelock==3.12.4
fonttools==4.43.1
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.40
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.0
idna==3.4
Jinja2==3.1.2
kiwisolver==1.4.5
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
networkx==3.2
numpy==1.26.1
oauthlib==3.2.2
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.1
Pillow==10.1.0
protobuf==4.23.4
psutil==5.9.6
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.1
thop==0.1.1.post2209072238
torch==2.1.0
torchvision==0.16.0
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
ultralytics==8.0.200
urllib3==2.0.7
Werkzeug==3.0.0

python version : python3.10.9 thanks in advance.

glenn-jocher commented 10 months ago

@aash1999 it seems like you've made quite a few changes to the YOLOv5m architecture, which can have a significant impact on performance metrics, especially precision and recall. Please keep in mind this is an advanced customization and may require careful debugging.

First, I'd recommend checking the VisDrone dataset annotations to ensure they're in the correct YOLO format. Also, verify that the class labels are consistent across your dataset and the model configuration.

Furthermore, please note your Python package versions are a bit outdated and a mix of various branches. Upgrading to the latest YOLOv5 package version or using the Docker image should help confirm if the issue is related to the architecture changes or the environment.

Besides, python version 3.10.9 is not supported by YOLOv5 yet. Please consider downgrading to a supported version, particularly Python>=3.7.0 before proceeding to avoid potential compatibility issues.

After addressing these points, if the issue persists, you could try to train the network without any architecture changes to establish a baseline detection performance and then gradually introduce modifications to better understand the impact.

In any case, I'd recommend ensuring that the dataset is suitable and well-prepared, and then verifying the model's performance on the original YOLOv5 architecture before introducing customizations.

Let me know if the issue persists after making these adjustments.

fyang064 commented 8 months ago

Hi @glenn-jocher

I am facing same issue where my precision and recall are 0.

but in my case I changed yolo5m architecture like below :

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1   1345042  models.common.CBAMBottleneck            [512, 512, 3]
 10                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 11                -1  1    279616  models.common.Involution                [512, 512, 1, 1]
 12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 6]  1         0  models.common.Concat                    [1]
 15                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 16                -1  1     66048  models.common.Conv                      [256, 256, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 4]  1         0  models.common.Concat                    [1]
 19                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 20                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 2]  1         0  models.common.Concat                    [1]
 23                -1  1     82688  models.common.C3                        [192, 128, 1, False]
 24                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 25          [-1, 19]  1         0  models.common.Concat                    [1]
 26                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 27                -1  1    295168  models.common.Conv                      [256, 128, 3, 2]
 28          [-1, 15]  1         0  models.common.Concat                    [1]
 29                -1  1    107264  models.common.C3                        [384, 128, 1, False]
 30                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 31          [-1, 11]  1         0  models.common.Concat                    [1]
 32                -1  1   1313792  models.common.C3                        [768, 512, 1, False]
 33  [23, 26, 29, 32]  1     46260  models.yolo.Detect                      [10, [[2.9434, 4.0435, 3.8626, 8.5592, 6.8534, 5.9391], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 128, 512]]
yolo5m-cbam-involution summary: 285 layers, 9335494 parameters, 9335494 gradients, 31.2 GFLOPs.

accordingly I changed common.py and yolo.py. epochs are running without any error but P and R are zero. am I doing something wrong?

dataset I am using : VisDrone

python packages:

absl-py==2.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
contourpy==1.1.1
cycler==0.12.1
filelock==3.12.4
fonttools==4.43.1
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.40
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.0
idna==3.4
Jinja2==3.1.2
kiwisolver==1.4.5
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
networkx==3.2
numpy==1.26.1
oauthlib==3.2.2
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.1
Pillow==10.1.0
protobuf==4.23.4
psutil==5.9.6
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.1
thop==0.1.1.post2209072238
torch==2.1.0
torchvision==0.16.0
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
ultralytics==8.0.200
urllib3==2.0.7
Werkzeug==3.0.0

python version : python3.10.9 thanks in advance.

Hi

Hi @glenn-jocher

I am facing same issue where my precision and recall are 0.

but in my case I changed yolo5m architecture like below :

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1   1345042  models.common.CBAMBottleneck            [512, 512, 3]
 10                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 11                -1  1    279616  models.common.Involution                [512, 512, 1, 1]
 12                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 13                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 14           [-1, 6]  1         0  models.common.Concat                    [1]
 15                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 16                -1  1     66048  models.common.Conv                      [256, 256, 1, 1]
 17                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 18           [-1, 4]  1         0  models.common.Concat                    [1]
 19                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 20                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 21                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 22           [-1, 2]  1         0  models.common.Concat                    [1]
 23                -1  1     82688  models.common.C3                        [192, 128, 1, False]
 24                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 25          [-1, 19]  1         0  models.common.Concat                    [1]
 26                -1  1    329216  models.common.C3                        [384, 256, 1, False]
 27                -1  1    295168  models.common.Conv                      [256, 128, 3, 2]
 28          [-1, 15]  1         0  models.common.Concat                    [1]
 29                -1  1    107264  models.common.C3                        [384, 128, 1, False]
 30                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 31          [-1, 11]  1         0  models.common.Concat                    [1]
 32                -1  1   1313792  models.common.C3                        [768, 512, 1, False]
 33  [23, 26, 29, 32]  1     46260  models.yolo.Detect                      [10, [[2.9434, 4.0435, 3.8626, 8.5592, 6.8534, 5.9391], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 128, 512]]
yolo5m-cbam-involution summary: 285 layers, 9335494 parameters, 9335494 gradients, 31.2 GFLOPs.

accordingly I changed common.py and yolo.py. epochs are running without any error but P and R are zero. am I doing something wrong?

dataset I am using : VisDrone

python packages:

absl-py==2.0.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.3.0
contourpy==1.1.1
cycler==0.12.1
filelock==3.12.4
fonttools==4.43.1
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.40
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.0
idna==3.4
Jinja2==3.1.2
kiwisolver==1.4.5
Markdown==3.5
MarkupSafe==2.1.3
matplotlib==3.8.0
mpmath==1.3.0
networkx==3.2
numpy==1.26.1
oauthlib==3.2.2
opencv-python==4.8.1.78
packaging==23.2
pandas==2.1.1
Pillow==10.1.0
protobuf==4.23.4
psutil==5.9.6
py-cpuinfo==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
smmap==5.0.1
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.1
thop==0.1.1.post2209072238
torch==2.1.0
torchvision==0.16.0
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
ultralytics==8.0.200
urllib3==2.0.7
Werkzeug==3.0.0

python version : python3.10.9 thanks in advance.

Hi @aash1999, just wondering if you have figured out the issue. I'd like to learn more if you are willing to share your experience. The original YOLOv5 architecture worked pretty well on my custom dataset before any customizations. However, P and R are always zero ever since I introduced the new loss.

glenn-jocher commented 8 months ago

Hi @fyang064,

Zero precision and recall after modifying the YOLOv5 architecture can be due to several reasons. Here are a few steps you can take to debug the issue:

  1. Sanity Check: Ensure that your modified model is capable of overfitting a very small dataset (e.g., 1-2 images). If it cannot, there might be an issue with the architecture changes.

  2. Data Loader: Verify that the data loader is correctly loading and preprocessing the images and labels. Check if the annotations are correct and match the input data.

  3. Learning Rate: Sometimes, if the learning rate is too high, the model may not learn effectively. Try reducing the learning rate.

  4. Loss Function: Confirm that the loss function is being calculated correctly and that gradients are flowing through the network as expected.

  5. Model Outputs: Inspect the raw outputs of the model to ensure they are sensible (e.g., not all zeros or NaNs).

  6. Backbone Pretraining: If you've introduced new layers or blocks (like CBAM or Involution), it might be beneficial to pretrain the backbone on a related task or dataset before fine-tuning on your target dataset.

  7. Batch Size: A very small batch size can sometimes lead to unstable training, especially with batch normalization layers.

  8. Anchor Boxes: If you've changed the architecture significantly, you might need to re-calculate the anchor boxes to better fit your dataset.

  9. Environment: As mentioned before, ensure your environment matches the requirements for YOLOv5. Python 3.10 is not officially supported, so consider using Python 3.7 or 3.8.

  10. Debugging Tools: Utilize debugging tools like printing shapes of tensors at various points, using PyTorch's torch.autograd.set_detect_anomaly(True), and visualizing feature maps.

Remember, when modifying architectures, it's crucial to make changes incrementally and test at each step to isolate where the issue might be occurring. If you're still facing issues, consider reverting to the last known good configuration and reintroducing changes one at a time.

fyang064 commented 8 months ago

Hi @fyang064,

Zero precision and recall after modifying the YOLOv5 architecture can be due to several reasons. Here are a few steps you can take to debug the issue:

  1. Sanity Check: Ensure that your modified model is capable of overfitting a very small dataset (e.g., 1-2 images). If it cannot, there might be an issue with the architecture changes.
  2. Data Loader: Verify that the data loader is correctly loading and preprocessing the images and labels. Check if the annotations are correct and match the input data.
  3. Learning Rate: Sometimes, if the learning rate is too high, the model may not learn effectively. Try reducing the learning rate.
  4. Loss Function: Confirm that the loss function is being calculated correctly and that gradients are flowing through the network as expected.
  5. Model Outputs: Inspect the raw outputs of the model to ensure they are sensible (e.g., not all zeros or NaNs).
  6. Backbone Pretraining: If you've introduced new layers or blocks (like CBAM or Involution), it might be beneficial to pretrain the backbone on a related task or dataset before fine-tuning on your target dataset.
  7. Batch Size: A very small batch size can sometimes lead to unstable training, especially with batch normalization layers.
  8. Anchor Boxes: If you've changed the architecture significantly, you might need to re-calculate the anchor boxes to better fit your dataset.
  9. Environment: As mentioned before, ensure your environment matches the requirements for YOLOv5. Python 3.10 is not officially supported, so consider using Python 3.7 or 3.8.
  10. Debugging Tools: Utilize debugging tools like printing shapes of tensors at various points, using PyTorch's torch.autograd.set_detect_anomaly(True), and visualizing feature maps.

Remember, when modifying architectures, it's crucial to make changes incrementally and test at each step to isolate where the issue might be occurring. If you're still facing issues, consider reverting to the last known good configuration and reintroducing changes one at a time.

Hi @glenn-jocher, appreciate your kind help and supportive advice. I have checked the code following your instructions and found that the output after going through process_batch function became insensible e.g., all zeros during the validation process. I guess the issue (all zeros for P, R, and AP during validation) occurred when I tried to introduce a new loss function, however, the training losses look normal to me. Look forward to your hand!

glenn-jocher commented 8 months ago

Hi @fyang064,

If the process_batch function is producing all zeros during validation, it suggests that the model's predictions are not matching any ground truth labels, which would indeed result in zero precision and recall. Here are a few additional steps to consider:

  1. Loss Function: Double-check the implementation of your new loss function. Ensure that it's properly computing gradients and that it's compatible with the rest of the model. It's possible that the loss function works well during training but fails to generalize to validation data.

  2. Output Activation: Verify that the activation functions at the output layer are appropriate for the task. For instance, object detection typically requires a sigmoid activation for the objectness score and class probabilities, and a linear activation for bounding box regression.

  3. Thresholds: Check the confidence and IoU thresholds used during validation. If they are set too high, it might result in all detections being filtered out.

  4. Data Augmentation: If you're using aggressive data augmentation, it might be overfitting to the training data and not generalizing well to the validation set. Try reducing or disabling augmentation to see if it affects the validation metrics.

  5. Validation Data: Ensure that the validation data is correctly labeled and that the labels are in the correct format. Also, confirm that the validation dataset is representative of the training data.

  6. Model Checkpoints: If you're loading weights from a checkpoint, ensure that the weights are compatible with the modified architecture.

  7. Debugging: Use debugging statements to print out the predictions and targets just before they are passed to the loss function during validation. This can help you identify if the issue is with the model predictions or the processing of data.

  8. Revert to Baseline: Temporarily revert to the original loss function and see if the validation metrics return to normal. This can help confirm whether the issue is with the new loss function.

  9. Gradual Changes: Introduce the new loss function gradually, starting with a weighted combination of the old and new losses, and monitor the effect on validation metrics.

  10. Consult the Community: If you're still stuck, consider reaching out to the community with details of your implementation. Sometimes, a fresh set of eyes can spot issues that are not immediately obvious.

Remember to make one change at a time and test thoroughly after each modification. This approach will help you isolate the problem more effectively. Good luck!

rashmisangwan commented 4 months ago

I'm also facing the same issue while using the ghost network and employing the Coordination Attention Mechanism with YOLOv5. However, I've updated the loss function of YOLOv5, but I didn't encounter this issue at that time.

glenn-jocher commented 4 months ago

Hi there!

It sounds like you're diving into some advanced customizations with YOLOv5 β€” that's awesome! πŸš€ When incorporating complex structures like the ghost network and Coordination Attention Mechanism, it's crucial to ensure all parts are seamlessly integrated. If you've previously updated the loss function without issues, but are now encountering problems, consider closely examining the interfaces between these components and YOLOv5's architecture.

A quick tip: Pay special attention to the shape and format of inputs and outputs at each modification point. Also, debugging prints can be very helpful to verify that data flows as expected through your modified network.

If your loss values and training metrics look correct but validation suffers, it might be valuable to revisit the validation data preparation and ensure it's aligned with your model's expected input format.

Keep experimenting, and feel free to share snippets of your integration code for more targeted advice. Happy coding! πŸš€