ClearML logging does not work at multi-gpu training

nameCDI commented 1 year ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

ClearML works well with python segment/python.py --weights ..., but it doesn’t show up at all with python -m torch.distributed.run --nproc_per_node 2 segment/train.py --weights --device 0, 1 ...

Is it normal for comet or clear info not to be displayed in this script?

Additional

No response

github-actions[bot] commented 1 year ago

👋 Hello @nameCDI, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

glenn-jocher commented 1 year ago

Hi @NameCDI! It should still be possible to use Comet (or Clear) with train.py when run with torch.distributed.run. Instead of calling comet_ml.Experiment(), you will need to start the experiment with comet_ml.init() first. Then, you can pass your Comet API key and the experiment name to train.py using the --name command line argument in the same way as --weights. For example:

python -m torch.distributed.run --nproc_per_node 2 segment/train.py --weights --name YOUR_EXPERIMENT_NAME --device 0, 1 ...

Please let me know if this works for you.

nameCDI commented 1 year ago

@glenn-jocher python train.py --weights yolov5n.pt --device 0 --project t2 --name t2 --epochs 3

python segment/train.py --weights yolov5n-seg.pt --project t1 --name t1 --device 0 --epochs 3

I created a new virtual environment, installed it, and ran it without modifying any code. 'train.py' works well with comet and clearml, but the script is omitted in 'segment/train.py'. It’s really strange. I don’t know why.

nameCDI commented 1 year ago

@glenn-jocher

2023-03-31 065222

And it’s the same in my other environment. train.py works well and segment/train.py doesn’t. I also checked with my app.clear.ml

glenn-jocher commented 1 year ago

@NameCDI Based on the error message you shared, it appears that the comet_ml package is missing or not installed in your environment. To resolve this error, you should try running pip install comet_ml in your terminal to install Comet. Alternatively, if you are using ClearML, try running pip install clearml to install the required package.

If you have already installed comet_ml or clearml, try running pip show comet_ml or pip show clearml to check the installation path and to make sure it has really been installed.

Let me know if this resolves your issue.

nameCDI commented 1 year ago

@glenn-jocher Actually, when this strange thing happened, while debugging, I added from clearml import Task task = Task.init(project_name="my project", task_name="my task") to segment/train.py and a clearml script was created and the project was created. But when I called clearml, two experiments were created in my app.clear.ml and one became a zombie state.

In my environment's capture image with a black background, not a white background, clearml logging seems to work normally, so it seems that there is a problem with the segment/train.py file.

Of course, I will try as soon as possible as you said.

glenn-jocher commented 1 year ago

@NameCDI It's possible that the creation of two experiments in ClearML is due to the Task.init() method being called twice, which could cause multiple experiments to be created. Regarding the black background in your environment, that could be due to your terminal emulator or a configuration issue.

In any case, please try installing Comet or ClearML via pip as I suggested and make sure the package is installed correctly. Then, follow the steps I outlined previously in order to integrate the package into segment/train.py. If you continue to experience issues, please feel free to share more information or code snippets so I can assist you further.

nameCDI commented 1 year ago

@glenn-jocher

pip show clearml

Name: clearml Version: 1.9.0 Summary: ClearML - Auto-Magical Experiment Manager, Version Control, and MLOps for AI Home-page: https://github.com/allegroai/clearml Author: ClearML Author-email: support@clear.ml License: Apache License 2.0 Location: /home/mgt/anaconda3/envs/cdi_v5/lib/python3.10/site-packages Requires: attrs, furl, jsonschema, numpy, pathlib2, Pillow, psutil, pyjwt, pyparsing, python-dateutil, PyYAML, requests, six, urllib3 Required-by:

clearml package check Complete

clearml-iniit

ClearML SDK setup process Configuration file already exists: /home/mgt/clearml.conf Leaving setup, feel free to edit the configuration file.

clearml-init && api register check Complete

My yolo project folder route ' /home/mgt/YOLOV5'

I have checked everything you mentioned, and 'python train.py' works fine with ClearML, and the script appears. However, when I run 'python segment/train.py', neither ClearML's nor Comet's script is visible, and the script doesn't run either. Can you tell me where ClearML's logger is called exactly? I'm trying to add the suggested code directly from ClearML

from clearml import Task
task = Task.init(project_name="my project", task_name="my task")

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher commented 10 months ago

@nameCDI The Task.init() call should be added at the beginning of your script, before any other operations, to initialize the ClearML task. By adding these lines at the start of segment/train.py, you can ensure that the task is properly initialized and that the metrics are logged correctly throughout the execution of the script. If the script still does not log to ClearML after adding these lines, it is possible that there may be other issues within the segment/train.py file that are preventing the ClearML logging from functioning properly. I would recommend carefully reviewing the ClearML integration guidelines and ensuring that the integration is properly set up within the segment/train.py script.

Let me know if adding the Task.init() call resolves the issue, or if you encounter any further difficulties.

ultralytics / yolov5