Terminating after "Plotting labels..." when training

KristofferK commented 3 years ago

While I am able to use YOLOv5 for inference, the train.py does not seem to work for me anymore. It did work previously however.

I have tried to clone the latest repo as well. I have set up a fresh Conda environment with Python 3.8. Again, inference works, but not training my custom data.

It will create the "exp" directory (exp24) in this case. Which contains an empty "weights" directory, hyp.yaml, opt.aml, and events.out.fs.events..0. No .pt, no images, no results.csv.

I have tried both the training set that I previously was able to train with and a new one I just created.

I run it using python train.py --img 640 --batch 4 --epochs 200 --data C:/Users/kristofferk/Documents/GitHub/p9-api/experiment/kristoffer/step06-data.yaml --weights yolov5s.pt

But when it comes to "Plotting labels..." it will be stuck there for about 20 seconds and then terminate without any further warnings or errors.

The output of running train.py is:

PS C:\Users\kristofferk\Documents\GitHub\yolov5> python train.py --img 640 --batch 4 --epochs 200 --data C:/Users/kristofferk/Documents/GitHub/p9-api/experiment/kristoffer/step06-data.yaml --weights yolov5s.pt
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
train: weights=yolov5s.pt, cfg=, data=C:/Users/kristofferk/Documents/GitHub/p9-api/experiment/kristoffer/step06-data.yaml, hyp=data\hyps\hyp.scratch.yaml, epochs=200, batch_size=4, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=0, save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/KristofferK/yolov5 
YOLOv5  v6.0-38-gc0c15d8 torch 1.8.2 CUDA:0 (NVIDIA GeForce RTX 3060, 12288.0MB)

hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/
wandb: Tracking run with wandb version 0.12.6
wandb: Syncing run stilted-monkey-10
wandb:  View project at https://wandb.ai/kristofferk/train
wandb:  View run at https://wandb.ai/kristofferk/train/runs/o18wqty1
wandb: Run data is saved locally in C:\Users\kristofferk\Documents\GitHub\yolov5\wandb\run-20211029_130316-o18wqty1
wandb: Run `wandb offline` to turn off syncing.

Overriding model.yaml nc=80 with nc=7

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  3    156928  models.common.C3                        [128, 128, 3]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1     32364  models.yolo.Detect                      [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 
326]], [128, 256, 512]]
Model Summary: 283 layers, 7079724 parameters, 7079724 gradients, 16.4 GFLOPs

Transferred 355/361 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 59 weight, 62 weight (no decay), 62 bias
train: Scanning 'C:\Users\kristofferk\Documents\GitHub\p9-api\experiment\kristoffer\datasets\malaria-yolov5\labels' images and labels...:   0%| | 0/50 [00:00<?, ?itwandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
train: Scanning 'C:\Users\kristofferk\Documents\GitHub\p9-api\experiment\kristoffer\datasets\malaria-yolov5\labels' images and labels...1 found, 0 missing, 0 empty,wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
train: Scanning 'C:\Users\kristofferk\Documents\GitHub\p9-api\experiment\kristoffer\datasets\malaria-yolov5\labels' images and labels...50 found, 0 missing, 0 empty 
train: New cache created: C:\Users\kristofferk\Documents\GitHub\p9-api\experiment\kristoffer\datasets\malaria-yolov5\labels.cache
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
val: Scanning 'C:\Users\kristofferk\Documents\GitHub\p9-api\experiment\kristoffer\datasets\malaria-yolov5\labels.cache' images and labels... 50 found, 0 missing, 0  
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
wandb: Currently logged in as: kristofferk (use `wandb login --relogin` to force relogin)
Plotting labels...
PS C:\Users\kristofferk\Documents\GitHub\yolov5>

Any suggestions on how to proceed from here? Either to fix it or at least get a more detailed error message.

Thanks in advance.

github-actions[bot] commented 3 years ago

👋 Hello @KristofferK, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

KristofferK commented 3 years ago

Public W&B Logging link: https://wandb.ai/kknuds19/train/runs/o18wqty1/overview?workspace=user-kknuds19

glenn-jocher commented 3 years ago

@KristofferK it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

KristofferK commented 3 years ago

@glenn-jocher It is a freshly setup Anaconda envrionment, with latest repo and requirements.txt. PyTorch is 1.8.2 (LTS).

glenn-jocher commented 3 years ago

@KristofferK unfortunately we don't have resources to help debug individual environments. If I were you I would create a venv and pip install everything, we don't use conda in our verified environments.

glenn-jocher commented 3 years ago

@KristofferK also for us to begin investigating an issue we need a minimum reproducible example. If we can't reproduce your issue there's no action for us to take. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

MrinalJain17 commented 3 years ago

@glenn-jocher Facing the same issue, when running on a Windows machine with a newly setup environment with all the dependencies installed correctly.

It seems like plot_labels for some reason kills the entire process. Commenting out the line of code below from train.py lead to a normal training, without any errors.

https://github.com/ultralytics/yolov5/blob/5d4258fac5e6ceaa9c897f841cb737c56717a996/train.py#L235

To confirm, I also executed plot_labels in isolation using a manually created data loader, and it ended up killing the process as well. Moreover, to make sure that it's not a memory issue, I was using just 5 images for testing.

EDIT: It seems like there was a bug recently introduced in a package called freetype. Found some mentions here:

It's only affecting windows machines.

@KristofferK Downgrading freetype to 2.10.4 fixed the issue.

KristofferK commented 3 years ago

@MrinalJain17 Thank you so much. That did indeed fix the issue. I hope yolov5 will either wrap the plot_labels in a Try/Except or force the version of the freetype package. I downgraded from 2.11.0 to 2.10.4, and it works again.

glenn-jocher commented 3 years ago

@MrinalJain17 thanks for looking into this! It seems like there is no action for us to take then based upon your conclusions?

We can try: except label plotting also, but I'm not sure it's best practices for downstream matplotlib users to all adjust their code for error handling here.

glenn-jocher commented 3 years ago

On MacOS I don't see any freetype package here either. This is what my environment looks like based upon pip install -r requirements.txt

(venv) (base) glennjocher@Glenns-iMac yolov5 % pip list
Package                 Version
----------------------- ---------------------
absl-py                 0.15.0
appnope                 0.1.2
backcall                0.2.0
cachetools              4.2.4
certifi                 2021.10.8
charset-normalizer      2.0.7
cycler                  0.10.0
decorator               5.1.0
google-auth             2.3.0
google-auth-oauthlib    0.4.6
grpcio                  1.41.0
idna                    3.3
ipython                 7.28.0
jedi                    0.18.0
kiwisolver              1.3.2
Markdown                3.3.4
matplotlib              3.4.3
matplotlib-inline       0.1.3
numpy                   1.21.3
oauthlib                3.1.1
opencv-python           4.5.4.58
pandas                  1.3.4
parso                   0.8.2
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  8.4.0
pip                     21.3.1
prompt-toolkit          3.0.21
protobuf                3.19.0
ptyprocess              0.7.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
Pygments                2.10.0
pyparsing               2.4.7
python-dateutil         2.8.2
pytz                    2021.3
PyYAML                  6.0
requests                2.26.0
requests-oauthlib       1.3.0
rsa                     4.7.2
scipy                   1.7.1
seaborn                 0.11.2
setuptools              57.0.0
six                     1.16.0
tensorboard             2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
thop                    0.0.31.post2005241907
torch                   1.10.0
torchvision             0.11.1
tqdm                    4.62.3
traitlets               5.1.0
typing-extensions       3.10.0.2
urllib3                 1.26.7
wcwidth                 0.2.5
Werkzeug                2.0.2
wheel                   0.36.2

MrinalJain17 commented 3 years ago

@MrinalJain17 thanks for looking into this! It seems like there is no action for us to take then based upon your conclusions?

We can try: except label plotting also, but I'm not sure it's best practices for downstream matplotlib users to all adjust their code for error handling here.

@glenn-jocher That makes sense. It's windows-specific, and hopefully a temporary issue.

However, I believe it would be helpful to have some sort of a "known issues" tracker for the YOLOv5 repository, which would describe any such errors along with some troubleshooting options. Even in the future, if some other third-party library breaks any part of the code, users can find that info (and relevant solutions) in the said tracker.

glenn-jocher commented 3 years ago

@MrinalJain17 yes a known issue tracker is certainly a good idea. We have a TODO list with about 20 items which somewhat handles this currently. We track these these through issue tags: https://github.com/ultralytics/yolov5/issues?q=is%3Aissue+label%3ATODO+

glenn-jocher commented 3 years ago

@MrinalJain17 seems like another Windows user had the same problem in #5611. I just realized another option besides try except is to use or utils.general.timeout. Maybe something like this:

@Timeout(30)
def plot_labels(labels, names=(), save_dir=Path('')):
    # plot dataset labels
...

glenn-jocher commented 3 years ago

@MrinalJain17 wait I just noticed a difference. In #5611 the process just hangs at plot_labels(), but you said in your case the process actually terminated by itself?

glenn-jocher commented 3 years ago

@MrinalJain17 @KristofferK good news 😃! Your original issue may now be fixed ✅ in PR #5616. This PR does not fix any underlying issues with matplotlib/freetype, but it does enclose plot_labels() in try: except and Timeout decorators to bypass it in case of issues. This means no label plots will be produced if errors/hangs are encountered, but training will proceed normally without issue.

https://github.com/ultralytics/yolov5/blob/def7a0fd19c1629903c3b073b4df265407719a07/utils/plots.py#L327-L331

To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

MrinalJain17 commented 3 years ago

@glenn-jocher So, if you notice this section below of the output from #5611 , this is actually what the issue was. Basically, anything remotely close to a matplotlib command ended up killing the entire process.

The Timeout() approach should be quite helpful in the future, if something unexpectedly breaks (but hopefully not).

Moreover, the issue was super-specific: It was for windows-machines using anaconda with the default channel. The good news is that they've yanked freetype officially: https://github.com/AnacondaRecipes/repodata-hotfixes/pull/150

WJos commented 2 years ago

Hi @KristofferK ! I see you are working on malaria detection. What kind of images are you working with( thick or thin smear). I have the same project and I want you to help me if possible. Thanks!

KristofferK commented 2 years ago

Hi @KristofferK ! I see you are working on malaria detection. What kind of images are you working with( thick or thin smear). I have the same project and I want you to help me if possible. Thanks!

Hello WJos. The malaria dataset is not actually what I am working on, rather it was to test out yolov5 before using it on my own dataset of drosophila. For malaria I used https://www.kaggle.com/kmader/malaria-bounding-boxes/ and converted it to yolov5 format. I might still have the code for the converter if you're interested.

yeshanliu commented 2 years ago

@MrinalJain17 seems like another Windows user had the same problem in #5611. I just realized another option besides try except is to use or utils.general.timeout. Maybe something like this:
@Timeout(30)
def plot_labels(labels, names=(), save_dir=Path('')):
    # plot dataset labels
...
It doesn't work well on Windows, because there is a 'signal.SIGALRM' in class 'Timeout'. It would thourgh a error like "module 'signal' has no attribute 'SIGALRM'. But it work well on Linux. How about remove Timeout(30) but still keep try_except?

glenn-jocher commented 2 years ago

@yeshanliu it appears you may have environment problems. The above code does work well on windows, windows is part of our daily CI testing:

https://github.com/ultralytics/yolov5/runs/5562838761?check_suite_focus=true

Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 2 years ago

@yeshanliu I investigated some more, it looks like the Windows CI tests are passing because the Try Except decorator is outside the Timeout decorator and is catching the SIGALARM error. So the good news is it works on Windows if you are using current code, the bad news is it works by skipping plotting labels. I think the solution is to put if else statements into Timeout and just put a note that it doesn't work on windows. I'll create a PR.

yeshanliu commented 2 years ago

That will be so good! And thanks for applying.

在 2022年3月16日，21:20，Glenn Jocher @.***> 写道：

@yeshanliu I investigated some more, it looks like the Windows CI tests are passing because the Try Except decorator is outside the Timeout decorator and is catching the SIGALARM error. So the good news is it works on Windows if you are using current code, the bad news is it works by skipping plotting labels. I think the solution is to put if else statements into Timeout and just put a note that it doesn't work on windows. I'll create a PR.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

glenn-jocher commented 2 years ago

@KristofferK @MrinalJain17 @WJos @yeshanliu good news 😃! Your original issue may now be fixed ✅ in PR #7013. This PR disables Timout using SIGALARM on Windows. To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

serifdogruu commented 2 years ago

i deleted --cache and problem solved

qqqhhh-any commented 1 year ago

I met the same problem,it seems that TryExcept doesnot work.I sure that all thrid-party packages were well installed but it still terminating after "Plotting labels".My OS is Ubuntu 18.04

yeshanliu commented 1 year ago

I met the same problem,it seems that TryExcept doesnot work.I sure that all thrid-party packages were well installed but it still terminating after "Plotting labels".My OS is Ubuntu 18.04

This issue is about "Plotting labels" terminating on Windows platform. The release (7.0 and 6.2) work well on my Ubuntu platform, so I suggest you check your release version and environment

glenn-jocher commented 12 months ago

hi @yeshanliu I'd recommend checking if your release versions are updated and if your environment meets all the necessary requirements. Make sure to use the latest release of the YOLOv5 repository and a complete installation of all required packages. You can refer to the installation instructions in the Ultralytics YOLOv5 documentation for a complete guide on setting up YOLOv5 on your Ubuntu 18.04 system. If the issue still persists, feel free to provide more details about your setup, and we can further investigate the problem.

ultralytics / yolov5