ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.24k stars 16.44k forks source link

Performance degradation #3933

Closed jaideep11061982 closed 3 years ago

jaideep11061982 commented 3 years ago

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

🐛 Bug

A clear and concise description of what the bug is. I find it takes now almost double time for model to finish 1 epoch . 4 minutes for my dataset with 215 iteration to now 9-10 min . any recent changes done?

To Reproduce (REQUIRED)

Input:

import torch

a = torch.tensor([5])
c = a / 0

Output:

Traceback (most recent call last):
  File "/Users/glennjocher/opt/anaconda3/envs/env1/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-be04c762b799>", line 5, in <module>
    c = a / 0
RuntimeError: ZeroDivisionError

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

glenn-jocher commented 3 years ago

@jaideep11061982 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

jaideep11061982 commented 3 years ago

Till last night it was all fine. Today i am running its too slow.

glenn-jocher commented 3 years ago

@jaideep11061982 for Albumentations integration see: https://github.com/ultralytics/yolov5/pull/3882

jaideep11061982 commented 3 years ago

Yes i had seen that. my concern is very there is slow down in running of epochs.

glenn-jocher commented 3 years ago

@jaideep11061982 we didn't see any notable slowdown when training COCO128 in the Colab notebook with Albumentations.

You can disable albumentations by uninstalling the package before training: pip uninstall albumentations

jaideep11061982 commented 3 years ago

@glenn-jocher if you see below progress bar of training also not appearing now. I reverted kaggle docker image also to old one. As i told you till lastnight i dint had any issue., issue started to come afternoon today IST. 7-8 hours ago. Performance degradation and No progress Bar


github: up to date with https://github.com/ultralytics/yolov5 ✅
2021-07-08 16:21:48.381809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-08 16:21:53.273274: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-08 16:21:53.275689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Downloading https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5m.pt to yolov5m.pt...
100%|██████████████████████████████████████| 41.1M/41.1M [00:01<00:00, 24.9MB/s]

train: Scanning '/kaggle/working/train' images and labels...4843 found, 0 missin
val: Scanning '/kaggle/working/val' images and labels...1211 found, 0 missing, 3
Plotting labels... 

autoanchor: Analyzing anchors... anchors/target = 4.79, Best Possible Recall (BPR) = 1.0000
      0/44     12.8G   0.08314   0.03411         0    0.1173        25       704
               Class     Images     Labels          P          R     mAP@.5 mAP@
                 all       1211       1496      0.177      0.239        0.1      0.021
      1/44     13.5G   0.06637   0.03363         0       0.1        49       768 
glenn-jocher commented 3 years ago

@jaideep11061982 as I already mentioned before, we require a minimum reproducible example otherwise there is no action for us to take. An MRE would consist of you providing code that we can run that shows the slowdown, i.e.:

git clone https://github.com/ultralytics/yolov5

git checkout some_commit
python train.py --epochs 3

git checkout some_other_commit
python train.py --epochs 3

If you can not provide this your issue is non-actionable on our part.

@jaideep11061982 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

jaideep11061982 commented 3 years ago

@glenn-jocher you can have check on this notebook https://www.kaggle.com/ammarnassanalhajali/covid-19-detection-yolov5-3classes-training/data

Check the training output its not printed. and there is Conda Env Exception

Let me know if this is sufficient or you would need more

glenn-jocher commented 3 years ago

@jaideep11061982 notebook looks very impressive! But I don't see any before and after example. What we need from you is a reproducible set of steps that demonstrates the issue you raised:

git clone https://github.com/ultralytics/yolov5
cd yolov5

git checkout SOME_COMMIT
python train.py --epochs 3  # presumably this commit works well

git checkout SOME_OTHER_COMMIT
python train.py --epochs 3  # presumably this commit demonstrates your observed 'performance degradation'
jaideep11061982 commented 3 years ago

ok. @glenn-jocher above is after eg. shall i now give u the link of notebook that has no issue so it would be a before eg.? will it work. In the meanwhile Are you able to reproduce the issue with current repository ? i request you to try running one epoch with current repository.. if you can simply fork above notebook and run it in kaggle you will clearly see the highlighted issue. With that you may be able to say if the issue is because of code or kaggle environment. Thanks for looking into this sofar

jaideep11061982 commented 3 years ago

BEFORE issue This is log of version that has no issue epoch time with same dataset is just 3 minutes

!git clone https://github.com/ultralytics/yolov5.git  # ran on time stamp below  2021-07-07 17:22:35. GMT
!WANDB_MODE="dryrun" python train.py --img $dim --batch $batch_size\
--epochs $epochs --data /kaggle/working/siim-cov19.yaml\
--weights yolov5m.pt 
2021-07-07 17:22:35.581968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-07-07 17:22:40.335730: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-07-07 17:22:40.338224: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
train: Scanning '/kaggle/working/train' images and labels...4893 found, 0 missing, 1449 empty, 0 corrupted: 100%|██████████| 4893/4893 [00:02<00:00, 2087.37it/s]
val: Scanning '/kaggle/working/val' images and labels...1224 found, 0 missing, 374 empty, 0 corrupted: 100%|██████████| 1224/1224 [00:01<00:00, 1194.82it/s]
Plotting labels... 

autoanchor: Analyzing anchors... anchors/target = 4.80, Best Possible Recall (BPR) = 1.0000
      0/36     5.58G   0.07859   0.02692         0    0.1055        39       512: 100%|██████████| 204/204 [02:20<00:00,  1.45it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 26/26 [00:14<00:00,  1.77it/s]
                 all       1224       1586      0.217       0.24      0.128     0.0262
      1/36     6.26G   0.06119   0.02398         0   0.08517        41       512: 100%|██████████| 204/204 [02:14<00:00,  1.52it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 26/26 [00:13<00:00,  1.94it/s]
                 all       1224       1586      0.304      0.289      0.202     0.0416
      2/36     6.26G   0.05679   0.02212         0   0.07891        57       512: 100%|██████████| 204/204 [02:12<00:00,  1.54it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 26/26 [00:16<00:00,  1.61it/s]
jaideep11061982 commented 3 years ago

AFTER Issue Log of version that gave an issue no output printed as above and epoch time also increased to 8 min from 3 mins,Timestamp of git clone is shown below in GMT Currently this is only Repo version as of now that is working fine https://www.kaggle.com/awsaf49/yolov5-official-v31-dataset

!git clone https://github.com/ultralytics/yolov5.git
ithub: up to date with https://github.com/ultralytics/yolov5 ✅
2021-07-08 10:20:30.813331: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-08 10:20:35.250085: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.

CondaEnvException: Unable to determine environment

Please re-run this command with one of the following options:

* Provide an environment name via --name or -n
* Re-run this command inside an activated conda environment.
jaideep11061982 commented 3 years ago

I suspect if there is any thing to do with Tensorflow library libcudart version which is 11 in AFTER and in Before that is 10.2 ,also in all Before versions that run fine.

jaideep11061982 commented 3 years ago

AFTER !git clone https://github.com/ultralytics/yolov5.git

! WANDB_MODE="dryrun"  python train.py --img $dim --batch $batch_size\
--epochs $epochs --data /kaggle/working/siim-cov19.yaml\
--weights "yolov5m.pt" --name exp_new   --workers 6
upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1
github: up to date with https://github.com/ultralytics/yolov5 ✅
2021-07-10 07:57:16.848634: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-10 07:57:21.110975: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Downloading https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5m.pt to yolov5m.pt...
100%|██████████| 41.1M/41.1M [00:00<00:00, 43.6MB/s]

train: Scanning '/kaggle/working/train' images and labels...5199 found, 0 missing, 1548 empty, 0 corrupted: 100%|██████████| 5199/5199 [00:02<00:00, 2222.34it/s]
val: Scanning '/kaggle/working/val' images and labels...918 found, 0 missing, 275 empty, 0 corrupted: 100%|██████████| 918/918 [00:00<00:00, 1225.09it/s]
Plotting labels... 

autoanchor: Analyzing anchors... anchors/target = 4.80, Best Possible Recall (BPR) = 1.0000
      0/36     5.75G   0.09266   0.02613         0    0.1188        55       512:  29%|██▉       | 63/217 [02:31<06:01,  2.35s/it]
jaideep11061982 commented 3 years ago

BEFORE https://www.kaggle.com/awsaf49/yolov5-official-v31-dataset #link to working repo shutil.copytree('/kaggle/input/yolov5-official-v31-dataset/yolov5', '/kaggle/working/yolov5')

! WANDB_MODE="dryrun"  python train.py --img $dim --batch $batch_size\
--epochs $epochs --data /kaggle/working/siim-cov19.yaml\
--weights "yolov5m.pt" --name exp_new   --workers 6

2021-07-10 08:04:32.581187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading https://github.com/ultralytics/yolov5/releases/download/v3.1/yolov5m.pt to yolov5m.pt...
100%|██████████████████████████████████████| 41.9M/41.9M [00:00<00:00, 58.4MB/s]

2021-07-10 08:04:39.721137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory.  Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Scanning '/kaggle/working/siim-cov19/labels/train' for images and labels... 5199 found, 0 missing, 1548 empty, 0 corrupted: 100%|██████████| 5199/5199 [00:01<00:00, 2680.64it/s]
Scanning '/kaggle/working/siim-cov19/labels/train.cache' for images and labels... 5199 found, 0 missing, 1548 empty, 0 corrupted: 100%|██████████| 5199/5199 [00:00<?, ?it/s]
Scanning '/kaggle/working/siim-cov19/labels/val' for images and labels... 918 found, 0 missing, 275 empty, 0 corrupted: 100%|██████████| 918/918 [00:00<00:00, 1077.04it/s]
Scanning '/kaggle/working/siim-cov19/labels/val.cache' for images and labels... 918 found, 0 missing, 275 empty, 0 corrupted: 100%|██████████| 918/918 [00:00<?, ?it/s]
Plotting labels... 

Analyzing anchors... anchors/target = 4.78, Best Possible Recall (BPR) = 1.0000
      0/36     6.37G   0.08126   0.03886         0    0.1201        67       512:  84%|████████▍ | 182/217 [02:01<00:23,  1.50it/s]
jaideep11061982 commented 3 years ago

@glenn-jocher i have arrange you the input in the format you were looking You can clearly see a big difference between performance of prev repo link given 2 min/epoch and new one that is latest 8 minutes per epoch.

Output issue was because of Wandb version which i reverted to old one so you can ignore that one

glenn-jocher commented 3 years ago

@jaideep11061982 if the issue is wandb related then @AyushExel may be able to help. @AyushExel this user is saying a wandb update is affecting training performance in Kaggle notebooks.

glenn-jocher commented 3 years ago

@jaideep11061982 I'm not sure what's in your notebook, but one thing I noticed is that it is pulling out of date YOLOv5 models from the v3.1 release. You should start from a fresh git clone of this repo for all future work.

AyushExel commented 3 years ago

@jaideep11061982 @glenn-jocher in both the runs above wandb is set to dryrun mode. So, I don't think the performance degradation has something to do with that

glenn-jocher commented 3 years ago

@jaideep11061982 yeah I think this is simply your environment.

If you truly want to create a reproducible example of performance between different versions of YOLOv5, or different versions of wandb, you need to use the exact format I showed you so that both before and after cases run in the exact same environment on the same hardware and on a common dataset like COCO128 so everyone else can reproduce.

For YOLOv5:

git clone https://github.com/ultralytics/yolov5
cd yolov5

git checkout COMMIT_THAT_WORKS_WELL
python train.py --epochs 3  # presumably this commit works well

git checkout COMMIT_THAT_PRODUCES_YOUR_ISSUE
python train.py --epochs 3  # presumably this commit demonstrates your observed 'performance degradation'

For wandb:

git clone https://github.com/ultralytics/yolov5
cd yolov5

pip install wandb==VERSION_THAT_WORKS_WELL
python train.py --epochs 3  # presumably this works well

pip install wandb==VERSION_THAT_PRODUCES_YOUR_ISSUE
python train.py --epochs 3  # presumably this demonstrates your observed 'performance degradation'
jaideep11061982 commented 3 years ago

@AyushExel @glenn-jocher just to clarify WANDB was separate issue no relation with performance. It affected only the visual of Progress Bar which is resolved after i installed desired version. Done. Please forget WANDB

Now the standing issue is if we compare the older version run and Newest version performance is different. Just to let you know kaggle upgraded their docker image 2 days back . I dont know what upgrade is working against Yolo . In Colab when I train YOlo using latest Repo i dont see any issue with performance, so I am clueless here what is acting against performance in kaggle after upgrade of Docker image there

Not sure in future this will eventually roll to all environments so every one will start facing issue.

github-actions[bot] commented 3 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!