Closed jaideep11061982 closed 3 years ago
@jaideep11061982 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.
When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
git pull
or git clone
a new copy to ensure your problem has not already been resolved by previous commits.If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.
Thank you! 😃
Till last night it was all fine. Today i am running its too slow.
@jaideep11061982 for Albumentations integration see: https://github.com/ultralytics/yolov5/pull/3882
Yes i had seen that. my concern is very there is slow down in running of epochs.
@jaideep11061982 we didn't see any notable slowdown when training COCO128 in the Colab notebook with Albumentations.
You can disable albumentations by uninstalling the package before training: pip uninstall albumentations
@glenn-jocher if you see below progress bar of training also not appearing now. I reverted kaggle docker image also to old one. As i told you till lastnight i dint had any issue., issue started to come afternoon today IST. 7-8 hours ago. Performance degradation and No progress Bar
github: up to date with https://github.com/ultralytics/yolov5 ✅
2021-07-08 16:21:48.381809: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-08 16:21:53.273274: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-08 16:21:53.275689: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Downloading https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5m.pt to yolov5m.pt...
100%|██████████████████████████████████████| 41.1M/41.1M [00:01<00:00, 24.9MB/s]
train: Scanning '/kaggle/working/train' images and labels...4843 found, 0 missin
val: Scanning '/kaggle/working/val' images and labels...1211 found, 0 missing, 3
Plotting labels...
autoanchor: Analyzing anchors... anchors/target = 4.79, Best Possible Recall (BPR) = 1.0000
0/44 12.8G 0.08314 0.03411 0 0.1173 25 704
Class Images Labels P R mAP@.5 mAP@
all 1211 1496 0.177 0.239 0.1 0.021
1/44 13.5G 0.06637 0.03363 0 0.1 49 768
@jaideep11061982 as I already mentioned before, we require a minimum reproducible example otherwise there is no action for us to take. An MRE would consist of you providing code that we can run that shows the slowdown, i.e.:
git clone https://github.com/ultralytics/yolov5
git checkout some_commit
python train.py --epochs 3
git checkout some_other_commit
python train.py --epochs 3
If you can not provide this your issue is non-actionable on our part.
@jaideep11061982 👋 hi, thanks for letting us know about this problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.
How to create a Minimal, Reproducible Example
When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:
- ✅ Minimal – Use as little code as possible that still produces the same problem
- ✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
- ✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem
In addition to the above requirements, for Ultralytics to provide assistance your code should be:
- ✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary
git pull
orgit clone
a new copy to ensure your problem has not already been resolved by previous commits.- ✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.
If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.
Thank you! 😃
@glenn-jocher you can have check on this notebook https://www.kaggle.com/ammarnassanalhajali/covid-19-detection-yolov5-3classes-training/data
Check the training output its not printed. and there is Conda Env Exception
Let me know if this is sufficient or you would need more
@jaideep11061982 notebook looks very impressive! But I don't see any before and after example. What we need from you is a reproducible set of steps that demonstrates the issue you raised:
git clone https://github.com/ultralytics/yolov5
cd yolov5
git checkout SOME_COMMIT
python train.py --epochs 3 # presumably this commit works well
git checkout SOME_OTHER_COMMIT
python train.py --epochs 3 # presumably this commit demonstrates your observed 'performance degradation'
ok. @glenn-jocher above is after eg. shall i now give u the link of notebook that has no issue so it would be a before eg.? will it work. In the meanwhile Are you able to reproduce the issue with current repository ? i request you to try running one epoch with current repository.. if you can simply fork above notebook and run it in kaggle you will clearly see the highlighted issue. With that you may be able to say if the issue is because of code or kaggle environment. Thanks for looking into this sofar
BEFORE issue This is log of version that has no issue epoch time with same dataset is just 3 minutes
!git clone https://github.com/ultralytics/yolov5.git # ran on time stamp below 2021-07-07 17:22:35. GMT
!WANDB_MODE="dryrun" python train.py --img $dim --batch $batch_size\
--epochs $epochs --data /kaggle/working/siim-cov19.yaml\
--weights yolov5m.pt
2021-07-07 17:22:35.581968: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-07-07 17:22:40.335730: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-07-07 17:22:40.338224: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
train: Scanning '/kaggle/working/train' images and labels...4893 found, 0 missing, 1449 empty, 0 corrupted: 100%|██████████| 4893/4893 [00:02<00:00, 2087.37it/s]
val: Scanning '/kaggle/working/val' images and labels...1224 found, 0 missing, 374 empty, 0 corrupted: 100%|██████████| 1224/1224 [00:01<00:00, 1194.82it/s]
Plotting labels...
autoanchor: Analyzing anchors... anchors/target = 4.80, Best Possible Recall (BPR) = 1.0000
0/36 5.58G 0.07859 0.02692 0 0.1055 39 512: 100%|██████████| 204/204 [02:20<00:00, 1.45it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 26/26 [00:14<00:00, 1.77it/s]
all 1224 1586 0.217 0.24 0.128 0.0262
1/36 6.26G 0.06119 0.02398 0 0.08517 41 512: 100%|██████████| 204/204 [02:14<00:00, 1.52it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 26/26 [00:13<00:00, 1.94it/s]
all 1224 1586 0.304 0.289 0.202 0.0416
2/36 6.26G 0.05679 0.02212 0 0.07891 57 512: 100%|██████████| 204/204 [02:12<00:00, 1.54it/s]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 26/26 [00:16<00:00, 1.61it/s]
AFTER Issue Log of version that gave an issue no output printed as above and epoch time also increased to 8 min from 3 mins,Timestamp of git clone is shown below in GMT Currently this is only Repo version as of now that is working fine https://www.kaggle.com/awsaf49/yolov5-official-v31-dataset
!git clone https://github.com/ultralytics/yolov5.git
ithub: up to date with https://github.com/ultralytics/yolov5 ✅
2021-07-08 10:20:30.813331: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-08 10:20:35.250085: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
CondaEnvException: Unable to determine environment
Please re-run this command with one of the following options:
* Provide an environment name via --name or -n
* Re-run this command inside an activated conda environment.
I suspect if there is any thing to do with Tensorflow library libcudart version which is 11 in AFTER and in Before that is 10.2 ,also in all Before versions that run fine.
AFTER
!git clone https://github.com/ultralytics/yolov5.git
! WANDB_MODE="dryrun" python train.py --img $dim --batch $batch_size\
--epochs $epochs --data /kaggle/working/siim-cov19.yaml\
--weights "yolov5m.pt" --name exp_new --workers 6
upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1
github: up to date with https://github.com/ultralytics/yolov5 ✅
2021-07-10 07:57:16.848634: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-07-10 07:57:21.110975: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Downloading https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5m.pt to yolov5m.pt...
100%|██████████| 41.1M/41.1M [00:00<00:00, 43.6MB/s]
train: Scanning '/kaggle/working/train' images and labels...5199 found, 0 missing, 1548 empty, 0 corrupted: 100%|██████████| 5199/5199 [00:02<00:00, 2222.34it/s]
val: Scanning '/kaggle/working/val' images and labels...918 found, 0 missing, 275 empty, 0 corrupted: 100%|██████████| 918/918 [00:00<00:00, 1225.09it/s]
Plotting labels...
autoanchor: Analyzing anchors... anchors/target = 4.80, Best Possible Recall (BPR) = 1.0000
0/36 5.75G 0.09266 0.02613 0 0.1188 55 512: 29%|██▉ | 63/217 [02:31<06:01, 2.35s/it]
BEFORE
https://www.kaggle.com/awsaf49/yolov5-official-v31-dataset #link to working repo
shutil.copytree('/kaggle/input/yolov5-official-v31-dataset/yolov5', '/kaggle/working/yolov5')
! WANDB_MODE="dryrun" python train.py --img $dim --batch $batch_size\
--epochs $epochs --data /kaggle/working/siim-cov19.yaml\
--weights "yolov5m.pt" --name exp_new --workers 6
2021-07-10 08:04:32.581187: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading https://github.com/ultralytics/yolov5/releases/download/v3.1/yolov5m.pt to yolov5m.pt...
100%|██████████████████████████████████████| 41.9M/41.9M [00:00<00:00, 58.4MB/s]
2021-07-10 08:04:39.721137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Scanning '/kaggle/working/siim-cov19/labels/train' for images and labels... 5199 found, 0 missing, 1548 empty, 0 corrupted: 100%|██████████| 5199/5199 [00:01<00:00, 2680.64it/s]
Scanning '/kaggle/working/siim-cov19/labels/train.cache' for images and labels... 5199 found, 0 missing, 1548 empty, 0 corrupted: 100%|██████████| 5199/5199 [00:00<?, ?it/s]
Scanning '/kaggle/working/siim-cov19/labels/val' for images and labels... 918 found, 0 missing, 275 empty, 0 corrupted: 100%|██████████| 918/918 [00:00<00:00, 1077.04it/s]
Scanning '/kaggle/working/siim-cov19/labels/val.cache' for images and labels... 918 found, 0 missing, 275 empty, 0 corrupted: 100%|██████████| 918/918 [00:00<?, ?it/s]
Plotting labels...
Analyzing anchors... anchors/target = 4.78, Best Possible Recall (BPR) = 1.0000
0/36 6.37G 0.08126 0.03886 0 0.1201 67 512: 84%|████████▍ | 182/217 [02:01<00:23, 1.50it/s]
@glenn-jocher i have arrange you the input in the format you were looking You can clearly see a big difference between performance of prev repo link given 2 min/epoch and new one that is latest 8 minutes per epoch.
Output issue was because of Wandb version which i reverted to old one so you can ignore that one
@jaideep11061982 if the issue is wandb related then @AyushExel may be able to help. @AyushExel this user is saying a wandb update is affecting training performance in Kaggle notebooks.
@jaideep11061982 I'm not sure what's in your notebook, but one thing I noticed is that it is pulling out of date YOLOv5 models from the v3.1 release. You should start from a fresh git clone of this repo for all future work.
@jaideep11061982 @glenn-jocher in both the runs above wandb is set to dryrun mode. So, I don't think the performance degradation has something to do with that
@jaideep11061982 yeah I think this is simply your environment.
If you truly want to create a reproducible example of performance between different versions of YOLOv5, or different versions of wandb, you need to use the exact format I showed you so that both before and after cases run in the exact same environment on the same hardware and on a common dataset like COCO128 so everyone else can reproduce.
For YOLOv5:
git clone https://github.com/ultralytics/yolov5
cd yolov5
git checkout COMMIT_THAT_WORKS_WELL
python train.py --epochs 3 # presumably this commit works well
git checkout COMMIT_THAT_PRODUCES_YOUR_ISSUE
python train.py --epochs 3 # presumably this commit demonstrates your observed 'performance degradation'
For wandb:
git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install wandb==VERSION_THAT_WORKS_WELL
python train.py --epochs 3 # presumably this works well
pip install wandb==VERSION_THAT_PRODUCES_YOUR_ISSUE
python train.py --epochs 3 # presumably this demonstrates your observed 'performance degradation'
@AyushExel @glenn-jocher just to clarify WANDB was separate issue no relation with performance. It affected only the visual of Progress Bar which is resolved after i installed desired version. Done. Please forget WANDB
Now the standing issue is if we compare the older version run and Newest version performance is different. Just to let you know kaggle upgraded their docker image 2 days back . I dont know what upgrade is working against Yolo . In Colab when I train YOlo using latest Repo i dont see any issue with performance, so I am clueless here what is acting against performance in kaggle after upgrade of Docker image there
Not sure in future this will eventually roll to all environments so every one will start facing issue.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:
git fetch && git status -uno
to check andgit pull
to update repoIf this is a custom dataset/training question you must include your
train*.jpg
,test*.jpg
andresults.png
figures, or we can not help you. You can generate these withutils.plot_results()
.🐛 Bug
A clear and concise description of what the bug is. I find it takes now almost double time for model to finish 1 epoch . 4 minutes for my dataset with 215 iteration to now 9-10 min . any recent changes done?
To Reproduce (REQUIRED)
Input:
Output:
Expected behavior
A clear and concise description of what you expected to happen.
Environment
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.