cubrink commented 3 years ago

🐛 Bug

A clear and concise description of what the bug is.

When using train.py a tensor is printed to screen and then the script ends. No training occurs.

To Reproduce (REQUIRED)

Input:

cd <path_to_save_to>/
git clone https://github.com/ultralytics/yolov5.git    # At time of writing: commit 683cefead4b9f2a8d062f953a912e46e456ed6ad
cd yolov5

conda create --name yolov5-bug          # Create new environment
conda activate yolov5-bug
conda install python==3.8.*             # Base python install
conda install -c pytorch torchvision    # Install pytorch from official conda channel. 
                                        # This auto configures GPU support and has worked for me in the past
                                        # Currently this installs torchvision==0.9.1, pytorch==1.8.1, cudatoolkit==10.2.89
conda install cudnn                     # More GPU setup

pip install -r requirements.txt         # Get remaining dependencies

# Download yolov5m.pt from https://github.com/ultralytics/yolov5/releases
# Place yolov5m.pt in yolov5/weights

# Download coco128
# Place at ../coco128 

python train.py --weights weights/yolov5m.pt \
                --data data/coco128.yaml \
                --cfg models/yolov5m.yaml \
                --name yolov5-bug

Output: The training script starts normally and sits idle briefly. Then a tensor is printed to screen and the script ends. Inititally start Then, output

After the script has stopped runs/train/yolov5-bug is only partially filled. I'm unsure if this is relevant. runs/train/yolov5-bug contents:

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         5/21/2021   3:11 PM                weights
------         5/21/2021   3:11 PM           5533 events.out.tfevents...
------         5/21/2021   3:11 PM            356 hyp.yaml
------         5/21/2021   3:11 PM         457560 labels.jpg
------         5/21/2021   3:11 PM         349727 labels_correlogram.jpg
------         5/21/2021   3:11 PM            672 opt.yaml
------         5/21/2021   3:11 PM         365691 train_batch0.jpg

(.tfevents filename partially redacted for privacy reasons)

Expected behavior

A clear and concise description of what you expected to happen.

Regular training on the coco128 test dataset.

Environment

If applicable, add screenshots to help explain your problem.

OS: Ubuntu 18.04.5 LTS
GPU 4x RTX 2080 Ti

Additional context

Add any other context about the problem here.

I've used older versions of YoloV5 before without issue. I recently decided to try updating before I ran into this issue. I will not be able to access this machine again until Monday.

github-actions[bot] commented 3 years ago

👋 Hello @cubrink, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

cubrink commented 3 years ago

I did some further testing by checking out different commits. The first commit that produced the bug is this one.

glenn-jocher commented 3 years ago

@cubrink thanks for the bug report. Conda environments may produce problems sometimes, I would recommend you try one of our verified environments above, including the Docker image.

A few comments about your example:

YOLOv5 models are autodownloaded on first request, so you can eliminate this step in your workflow.
COCO128 is autodownloaded on first request, so you can eliminate this step in your workflow
Multi-GPU training should be done using DDP commands for best results. See Multi-GPU training tutorial below.

YOLOv5 Tutorials

Train Custom Data 🚀 RECOMMENDED
Tips for Best Training Results ☘️ RECOMMENDED
Weights & Biases Logging 🌟 NEW
Supervisely Ecosystem 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
TorchScript, ONNX, CoreML Export 🚀
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
TensorRT Deployment

cubrink commented 3 years ago

@glenn-jocher, thank you for the advice, I'm aware of the auto-downloading features but the firewall on the machine in question blocks the download.

I've since ran the script in the Docker container using DDP and have ran into similar issues.

Edit: I can again confirm that the first commit to break the training script is this one. This was the same commit that broke train.py in my initial bug report.

It looks like the line that breaks train.py is:

if tb_writer:
    tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), [])  # add model graph

On line 333 of train.py. I was able to fix train.py by commenting out the previously mentioned lines.

The bugs:

Within docker, using DDP: The training hangs indefinitely shortly after starting. Within docker, without DDP: The original error occurs. (Tensor is printed then script exits)

Environment

OS: Ubuntu 18.04.5 LTS
GPU: 4x RTX 2080 Ti
NVIDIA Drivers: 465.19.01
Docker version: 19.03.13

To Reproduce

Basic setup

# Volume added due to autodownload being blocked due to firewall settings
# On a different machine the volume would not be needed.
# The data is from official sources so they should work without issue.

sudo docker --rm --ipc=host --gpus all -it -v <my_local_path>/yolov5_resources ultralytics/yolov5:latest

# (Due to personal firewall settings, copy weights into ./weights and dataset into ../coco128)

(At the time of writing the commit from ultralytics/yolov5:latest is 61ea23c)

Test with DDP:

python -m torch.distributed.launch \
    --nproc_per_node 4 \
    train.py \
    --weights weights/yolov5m.pt \
    --cfg models/yolov5m.yaml \
    --data data/coco128.yaml \
    --device 0,1,2,3

Result: Training hangs after 2 batches. I've let this sit for about 10 minutes without progress. DDP (Note the warning that is raised before hanging: UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as it.)

Test without DDP:

python train.py \
    --weights weights/yolov5m.pt \
    --cfg models/yolov5m.yaml \
    --data data/coco128.yaml \
    --device 0,1,2,3

noDDP

Result: Tensor is printed by train.py then then the script exits. (This is the same bug that was originally reported.)

Expected Behavior

Regular training on the coco128 datatset.

adrigrillo commented 3 years ago

I have also encountered with this error training with my own dataset. Just to add some information about the problem, the stack trace before printing the full tensor is:

Traceback (most recent call last):
  File "train.py", line 591, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 374, in train
    tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), [])  # add model graph
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 733, in trace
    return trace_module(
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 934, in trace_module
    module._c._create_method_from_trace(
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 725, in _call_impl
    result = self._slow_forward(*input, **kwargs)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 709, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
    return self.gather(outputs, self.output_device)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient
Tensor:
[...]

Additionally, as @cubrink mentioned, commenting the line out the script runs correctly.

glenn-jocher commented 3 years ago

Thanks guys. This must be related to a recent PR https://github.com/ultralytics/yolov5/pull/3236 that re-added the TensorBoard graph for interactive model architecture viewing in TensorBoard. It seems to be failing in some environments as in your examples above though. I'll take a look at this today. Worst case scenario we can drop it into a try: except statement.

@adrigrillo your error in particular seems to imply that a model with gradients might be causing the issue, or perhaps the nn/parallel lines are implying that only DP or DDP models in particular are causing the error ?

adrigrillo commented 3 years ago

Now that you comment that, the problem could be produced due to the use of DP mode which I was not intending to use in reality. I forgot to specify the GPU to use, as I wanted to use single GPU training, and I guess that, by default, if two GPUs are available and no GPU flag specified the training will use both in DP mode.

Anyways, commenting the line makes it work in DP mode and with one GPU also works. Therefore, the problem may be related to the multi-gpu training and not with the gradient message.

glenn-jocher commented 3 years ago

@adrigrillo ok thanks, that makes sense then. It's likely that the graph/script functions were just never intended for use with DP/DDP. I think we can use the same if is_parallel() statement used elsewhere in these cases: https://github.com/ultralytics/yolov5/blob/407dc5008e47b1aad5ce69f0c91b4f1ec321dd7f/train.py#L393

glenn-jocher commented 3 years ago

@adrigrillo @cubrink good news 😃! Your original issue may now be fixed ✅ in PR #3325. This PR de-parallelizes the model before passing it to the TensorBoard add_graph() function. This should resolve the original issue if it was only observed in multi-GPU trainings. There is a userwarning currently when the graph is saved, but this is expected and should not cause any problems (UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as it.) To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

shufanwu commented 3 years ago

@glenn-jocher Thanks for your great work, but I found the problem mentioned in https://github.com/ultralytics/yolov5/issues/3284#issuecomment-847270860 still existed. when I use Docker to run the code with DDP, the training hangs indefinitely.

glenn-jocher commented 3 years ago

@shufanwu hi thanks for the feedback. Can you confirm you are seeing this error in the latest Docker image? You can pull the latest image using the command below.

Docker – sudo docker pull ultralytics/yolov5:latest to update your image

kanybekasanbekov commented 3 years ago

@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed. when I use Docker to run the code with DDP, the training hangs indefinitely.

https://github.com/ultralytics/yolov5/pull/3325 solved for me

kanybekasanbekov commented 3 years ago

@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed. when I use Docker to run the code with DDP, the training hangs indefinitely.

3325 solved for me

In addition, develop(default) branch still has this issue, whereas master branch was fixed.

glenn-jocher commented 3 years ago

@kanybekasanbekov @SkalskiP yes as you noticed we have a new develop branch. The idea is to adopt a more best practices workflow where we mostly update the develop branch and periodically merge to master on a new patch release. master < develop < feature

Though bug fixes will take a different route: master < fix

ultralytics / yolov5

The train.py script prints tensor then quits #3284

🐛 Bug

To Reproduce (REQUIRED)

Expected behavior

Environment

Additional context

Requirements

Environments

Status

YOLOv5 Tutorials

The bugs:

Environment

To Reproduce

Test with DDP:

Test without DDP:

Expected Behavior

3325 solved for me