Unable to get loss value and metrics in results.csv file

ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

https://docs.ultralytics.com

GNU Affero General Public License v3.0

51.22k stars 16.44k forks source link

Unable to get loss value and metrics in results.csv file #11292

Closed tasawor closed 1 year ago

tasawor commented 1 year ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hi, After i run training command, model starts training, but results.csv file for loss and metrics. Could you please help? I have attached screenshot below.

Additional

No response

github-actions[bot] commented 1 year ago

👋 Hello @tasawor, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

glenn-jocher commented 1 year ago

@tasawor hi, from the screenshot you attached, it looks like the results.csv file is not being generated during training. This could be due to an error in your YOLOv5 code or configuration.

You can try checking for any errors or warnings that arise during training by looking at the output in your command prompt or terminal. Additionally, you can add the --verbose flag to your training command to display more detailed information during training.

If you are still having trouble with generating the results.csv file, please provide more information about your YOLOv5 installation, relevant configuration files, and the command you used to train your model so we can better assist you.

tasawor commented 1 year ago

@glenn-jocher While installing requirements.txt file, i get some warnings shown below. Could it be the reason? and how can i solve it?

glenn-jocher commented 1 year ago

@tasawor, the warnings you are seeing during the installation of the requirements.txt file are related to the installation of PyTorch, which is a dependency of YOLOv5.

These warnings are related to the version of CUDA on your system and are generally safe to ignore. However, if you want to make sure that PyTorch was installed correctly, you can try running a simple PyTorch script, such as:

import torch
print(torch.__version__)
print(torch.cuda.is_available())

If this script runs without error and prints a PyTorch version and a True value for torch.cuda.is_available(), then PyTorch was installed correctly and you can proceed with installing YOLOv5.

However, if you encounter any issues building or running YOLOv5 after installing the dependencies, please make sure to check that your installation meets the minimum requirements for YOLOv5 and consult the YOLOv5 documentation for help with any issues.

tasawor commented 1 year ago

@tasawor hi, from the screenshot you attached, it looks like the results.csv file is not being generated during training. This could be due to an error in your YOLOv5 code or configuration.

You can try checking for any errors or warnings that arise during training by looking at the output in your command prompt or terminal. Additionally, you can add the --verbose flag to your training command to display more detailed information during training.

If you are still having trouble with generating the results.csv file, please provide more information about your YOLOv5 installation, relevant configuration files, and the command you used to train your model so we can better assist you.

@glenn-jocher I am refering to steps shown on ultralytics for installation. Please find the screenshot of commands, yaml file and log file below. In log file, it shows 3.4GB RAM required, 3.3/15.8GB available, not caching images. Do you it might be causing issue? I triedto separate train and val folder also, but it still show same.

glenn-jocher commented 1 year ago

@tasawor, based on the logs you provided it seems like your training is running without any errors, and that the issue might be related to the size of your training data and available memory.

The message "3.4GB RAM required, 3.3/15.8GB available, not caching images" suggests that your system does not have enough available memory to cache all of your training images, which can slow down your training and may have an impact on the accuracy of your model.

One potential solution is to reduce the batch size for your training, which will reduce the amount of memory required to process each batch. You can try lowering the batch-size parameter in your train.yaml configuration file to see if that helps.

Alternatively, you can try running your training on a system with more available memory if possible.

If you continue to experience issues, please provide more information about your system and training data, and any errors or warnings you encounter during training so we can better assist you.

tasawor commented 1 year ago

@tasawor, based on the logs you provided it seems like your training is running without any errors, and that the issue might be related to the size of your training data and available memory.

The message "3.4GB RAM required, 3.3/15.8GB available, not caching images" suggests that your system does not have enough available memory to cache all of your training images, which can slow down your training and may have an impact on the accuracy of your model.

One potential solution is to reduce the batch size for your training, which will reduce the amount of memory required to process each batch. You can try lowering the batch-size parameter in your train.yaml configuration file to see if that helps.

Alternatively, you can try running your training on a system with more available memory if possible.

If you continue to experience issues, please provide more information about your system and training data, and any errors or warnings you encounter during training so we can better assist you.

@glenn-jocher i don't see any train.yaml configration file. Can you help me locate it? Further, i tried to reduce batch size, but it still shows nan

glenn-jocher commented 1 year ago

@tasawor, the train.yaml configuration file is mentioned in the screenshot you provided earlier. It can be found in the data folder of the YOLOv5 repository.

Regarding the nan loss value, this typically indicates a numerical overflow or underflow during training, which can occur if the network is taking too large of steps in updating the model weights.

One possible cause of this issue is a very high learning rate, which can cause the optimization algorithm to take unstable steps in updating the model weights. You can try reducing the learning rate by adjusting the lr parameter in your train.yaml configuration file, and see if that resolves the issue.

It's also possible that the issue is related to other hyperparameters or the dataset itself, so it's important to check for any other issues or errors during training that could affect the accuracy of the model. You can try running your training with the --verbose flag or adjusting other hyperparameters to see if that helps identify the source of the issue.

tasawor commented 1 year ago

@tasawor, the train.yaml configuration file is mentioned in the screenshot you provided earlier. It can be found in the data folder of the YOLOv5 repository.

Regarding the nan loss value, this typically indicates a numerical overflow or underflow during training, which can occur if the network is taking too large of steps in updating the model weights.

One possible cause of this issue is a very high learning rate, which can cause the optimization algorithm to take unstable steps in updating the model weights. You can try reducing the learning rate by adjusting the lr parameter in your train.yaml configuration file, and see if that resolves the issue.

It's also possible that the issue is related to other hyperparameters or the dataset itself, so it's important to check for any other issues or errors during training that could affect the accuracy of the model. You can try running your training with the --verbose flag or adjusting other hyperparameters to see if that helps identify the source of the issue.

@glenn-jocher i found out that if i train coco128 dataset, everything works fine and i can get loss values and metrics in results.csv. But when i try it with custom dataset, it doesn't show anything in results.csv. I am using thermal images as dataset. My dataset seems ok. Could you suggest anything?

glenn-jocher commented 1 year ago

@tasawor, when training a custom dataset, it's possible that there are issues with the annotations or data format that could cause YOLOv5 to behave differently compared to the COCO dataset.

One issue could be related to the number of classes in your dataset. By default, the YOLOv5 train.py script is configured to detect 80 classes for COCO, but if your dataset has a different number of classes, you will need to modify the nc parameter in your train.yaml configuration file accordingly.

Another issue could be related to the format of your annotations. YOLOv5 expects annotations in the YOLO format, which consists of one row per object, with each row containing the class index and the bounding box coordinates in normalized format.

If these issues do not help, you can try adding the --verbose flag to the training command to see if there are any errors or warnings that are specific to your dataset. Additionally, make sure that your training and validation datasets are correctly configured and that the paths to your images and annotations are correctly specified in your configuration file.

tasawor commented 1 year ago

@glenn-jocher Is it necessary that i have to change head of YOLO model if number of classes are less that 80 (let's say 9). Maybe that causes problem in my case. If its so, how do I make changes in head.

My annotations are in YOLO format.

I am able to get losses and metrics when i train coco128 dataset and everything works fine as long as i keep 80 classes in coco128.yaml file. when i increase one more class in it, it doesn't work. When i use custom dataset,it shows nan in looses and metrics as shown in above comments.

I also want to ask that is data.yaml same as model.yaml file.If no, is model.yaml file yolov5s.yaml? (if i use yolov5s).

tasawor commented 1 year ago

@glenn-jocher

@glenn-jocher Is it necessary that i have to change head of YOLO model if number of classes are less that 80 (let's say 9). Maybe that causes problem in my case. If its so, how do I make changes in head.

My annotations are in YOLO format.

I am able to get losses and metrics when i train coco128 dataset and everything works fine as long as i keep 80 classes in coco128.yaml file. when i increase one more class in it, it doesn't work. When i use custom dataset,it shows nan in looses and metrics as shown in above comments.

I also want to ask that is data.yaml same as model.yaml file.If no, is model.yaml file yolov5s.yaml? (if i use yolov5s).

@glenn-jocher could you please help?

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher commented 1 year ago

@tasawor, if you are working with a dataset that contains a smaller number of classes, you will need to make changes to the YOLO model's head to reflect the number of classes in your dataset.

To modify the head of the YOLO model, you would need to update the nc (number of classes) parameter in the yolov5s.yaml (or the relevant model config file depending on the model variant you are using) to match the number of classes in your dataset.

Additionally, you should ensure that your annotations are in the YOLO format, which consists of the class index and the bounding box coordinates in normalized format.

Regarding the relationship between data.yaml and model.yaml, these are separate configuration files. The data.yaml file contains the path to your training and validation datasets, along with other data-related settings, while the model.yaml file contains the model architecture and settings specific to the YOLOv5 model variant you are using.

I hope this helps! Please let me know if you have any further questions or if there's anything else I can assist you with.