RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation when running on Docker

NanoCode012 commented 3 years ago

🐛 Bug

I got the below error message when I try to test out the latest commit cff9263490fbf4b80dcc2d87914e087e6c07b6a0 on a new docker image. I haven't pulled recently, so I'm not sure which commit made this error.

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

To Reproduce (REQUIRED)

Pull docker and run it
Run python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache

Output:

 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1248768  models.common.BottleneckCSP             [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Traceback (most recent call last):
  File "train.py", line 492, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 83, in train
    model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device)  # create
  File "/usr/src/app/models/yolo.py", line 95, in __init__
    self._initialize_biases()  # only run once
  File "/usr/src/app/models/yolo.py", line 150, in _initialize_biases
    b[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

Expected behavior

Run normally

Environment

Docker + JupyterLab (from my repo)
CPU, 1 GPU, Multi-GPU

Additional context

It seems to run fine when I'm running from an old conda py37 environment with torch 1.6. I cannot reproduce this error on Google Colab. Could there be something wrong with Docker dependencies?

glenn-jocher commented 3 years ago

@NanoCode012 thanks for the bug report. I'll try to reproduce with yolov5:latest on a GCP instance.

I've seen this error in the past when running in-place ops like L150 in your error message with autograd on, but that line has not changed in a long time. PyTorch versions are changing though, so perhaps this is handled differently now.

glenn-jocher commented 3 years ago

Yeah I get the same result. I think the issue is that nvidia seems to prefer pytorch nightly for their FROM images rather than the last stable release, so I can't tell if this is a nightly instability or there's some 1.8 update set to cause errors on this in the future.

If I pull latest and then run this line, everything trains fine.

pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

I guess for now I'll simply reset the image FROM tag to 20.10, which I think was working well.

glenn-jocher commented 3 years ago

Or wait, I just had a great idea! I think if I start from a different base image, such as pytorch/pytorch:latest, then this seems to point to the last stable release, and perhaps eliminates maintenance also as the tag never changes. I will try an experiment and see if it works.

FROM nvcr.io/nvidia/pytorch:20.11-py3
FROM pytorch/pytorch:latest

glenn-jocher commented 3 years ago

I tried to create a pytorch:latest image here with this Dockerfile, but the image lacks some dependencies like cv2, which are causing problems on pip install, so I gave up on it. The Dockerfile is here in case anyone can debug this. In the meantime I think a rollback to 20.10 will fix this, I'll get that done.

docker pull ultralytics/yolov5:pytorch_latest

FROM pytorch/pytorch:latest

# Install dependencies
RUN pip install --upgrade pip
# COPY requirements.txt .
# RUN pip install -r requirements.txt
RUN pip install gsutil

# Create working directory
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

# Copy contents
COPY . /usr/src/app

glenn-jocher commented 3 years ago

Verified new image works, problem should be resolved now in PR #1553

NanoCode012 commented 3 years ago

Thanks glenn! I will wait for image to build from dockerhub and test it!

Regarding pytorch:latest, I think it could be dangerous to use it in DockerFile because if there is some breaking change, you may not know till someone reports it.

Edit: This would mean that this repo will not be able to use later versions of nvidia's package until this bug is fixed somehow..

glenn-jocher commented 3 years ago

@NanoCode012 yes that's true. The docker images don't actually have any CI tests, they just build on every commit under the assumption that the github CI tests would mostly apply to docker as well, but it is true that they often may use different PyTorch versions. GitHub also updates their dependencies on their own schedule, so when 1.6 came out for example the next day we had the daily CI test failing.

cesarandreslopez commented 3 years ago

@glenn-jocher correct me if I am wrong, but both nvcr.io/nvidia/pytorch:20.11-py3 and nvcr.io/nvidia/pytorch:20.10-py3 seems to use python 3.6

This project requests 3.8 or above.

Will this be a problem?

glenn-jocher commented 3 years ago

@cesarandreslopez yes I noticed that as well. I'm not sure if 3.6.0 is compatible with this repo, I think the last one I checked was using 3.6.9. I'm doing all development in 3.8.0, but in general backwards compatibility is something I don't have lots of time to maintain and verify, which is the reason I've simply put 3.8 down as the requirement.

But as you're seeing 3.7 appears compatible, as well as possibly much of the 3.6.

MingcongCao commented 3 years ago

Hi, guys. @NanoCode012 @glenn-jocher The following code works for me: with torch.no_grad(): b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls

glenn-jocher commented 3 years ago

@MingcongCao ah, I've resolved the original issue by resetting the base image to Nvidia 20.10, so all docker operations should be operating correctly now.

hcodee commented 3 years ago

I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?

python train.py --batch-size 64 --data ./data/coco128.yaml --cfg ./models/yolov5s.yaml --weights ''

Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB)

Traceback (most recent call last): File "train.py", line 492, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 91, in train model = Model(opt.cfg, ch=3, nc=nc).to(device) # create File "/home/yons/work/yolov5/models/yolo.py", line 95, in init self._initialize_biases() # only run once File "/home/yons/work/yolov5/models/yolo.py", line 150, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

NanoCode012 commented 3 years ago

@hcodee , could you try stable torch 1.7?

hcodee commented 3 years ago

The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8.

batrlatom commented 3 years ago

The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8.

You need to compile pytorch yourself witch cuda 11.1 installed. It is doable, I did it without any hassle ( surprisingly ) from master. Unfortunately I need to do it again for 1.7

hcodee commented 3 years ago

@batrlatom Cool, Thanks remind. I will try it out.

dnth commented 3 years ago

I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?

python train.py --batch-size 64 --data ./data/coco128.yaml --cfg ./models/yolov5s.yaml --weights ''

Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB)

Traceback (most recent call last): File "train.py", line 492, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 91, in train model = Model(opt.cfg, ch=3, nc=nc).to(device) # create File "/home/yons/work/yolov5/models/yolo.py", line 95, in init self._initialize_biases() # only run once File "/home/yons/work/yolov5/models/yolo.py", line 150, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

Same problem with nighly pytorch version here. Any luck with using the self compiled pytorch 1.8?

DoctorKey commented 3 years ago

I met the same issue with pytorch 1.8, and the following code works for me:

b.data[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum())  # cls

glenn-jocher commented 3 years ago

I just ran into this issue myself, so it's time for a fix :) Will add a TODO and prioritize this for a fix ASAP.

glenn-jocher commented 3 years ago

@DoctorKey can confirm your solution works correctly. I will submit a PR for this to master.

glenn-jocher commented 3 years ago

@NanoCode012 @DoctorKey @batrlatom @hcodee this problems should be resolved now by implementing @DoctorKey fix in PR #1759. Docker image for ultralytics/yolov5:latest should be updated in a few minutes with this fix.

Let me know if any other issues pop up, and thank you for your contributions!

Nytsirch commented 3 years ago

Hi i am new to this i just encountered a runtime problem Traceback (most recent call last): File "train.py", line 492, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 91, in train model = Model(opt.cfg, ch=3, nc=nc).to(device) # create File "/content/yolov5/models/yolo.py", line 95, in init self._initialize_biases() # only run once File "/content/yolov5/models/yolo.py", line 150, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

i am using torch 1.8.0+cu101

i really dont know what to do. any help please

glenn-jocher commented 3 years ago

@Nytsirch this error is likely generated by an unsupported 3rd party notebook. Please see the official YOLOv5 Colab Notebook below, and visit the Train Custom Data Tutorial to get started with YOLOv5. https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

Tutorials

Train Custom Data 🚀 RECOMMENDED
Weights & Biases Logging 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
ONNX and TorchScript Export
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
TensorRT Deployment

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

NingAnMe commented 3 years ago

Hi i am new to this i just encountered a runtime problem Traceback (most recent call last): File "train.py", line 492, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 91, in train model = Model(opt.cfg, ch=3, nc=nc).to(device) # create File "/content/yolov5/models/yolo.py", line 95, in init self._initialize_biases() # only run once File "/content/yolov5/models/yolo.py", line 150, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

i am using torch 1.8.0+cu101

i really dont know what to do. any help please

change the two old lines in 'yolo.py'

b[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum())  # cls

to new

b.data[:, 4] += math.log(8 / (640 / s) ** 2)  # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum())  # cls

ultralytics / yolov5