Closed NanoCode012 closed 3 years ago
@NanoCode012 thanks for the bug report. I'll try to reproduce with yolov5:latest on a GCP instance.
I've seen this error in the past when running in-place ops like L150 in your error message with autograd on, but that line has not changed in a long time. PyTorch versions are changing though, so perhaps this is handled differently now.
Yeah I get the same result. I think the issue is that nvidia seems to prefer pytorch nightly for their FROM images rather than the last stable release, so I can't tell if this is a nightly instability or there's some 1.8 update set to cause errors on this in the future.
If I pull latest and then run this line, everything trains fine.
pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
I guess for now I'll simply reset the image FROM tag to 20.10, which I think was working well.
Or wait, I just had a great idea! I think if I start from a different base image, such as pytorch/pytorch:latest, then this seems to point to the last stable release, and perhaps eliminates maintenance also as the tag never changes. I will try an experiment and see if it works.
FROM nvcr.io/nvidia/pytorch:20.11-py3
FROM pytorch/pytorch:latest
I tried to create a pytorch:latest image here with this Dockerfile, but the image lacks some dependencies like cv2, which are causing problems on pip install, so I gave up on it. The Dockerfile is here in case anyone can debug this. In the meantime I think a rollback to 20.10 will fix this, I'll get that done.
docker pull ultralytics/yolov5:pytorch_latest
FROM pytorch/pytorch:latest
# Install dependencies
RUN pip install --upgrade pip
# COPY requirements.txt .
# RUN pip install -r requirements.txt
RUN pip install gsutil
# Create working directory
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
# Copy contents
COPY . /usr/src/app
Verified new image works, problem should be resolved now in PR #1553
Thanks glenn! I will wait for image to build from dockerhub and test it!
Regarding pytorch:latest, I think it could be dangerous to use it in DockerFile because if there is some breaking change, you may not know till someone reports it.
Edit: This would mean that this repo will not be able to use later versions of nvidia's package until this bug is fixed somehow..
@NanoCode012 yes that's true. The docker images don't actually have any CI tests, they just build on every commit under the assumption that the github CI tests would mostly apply to docker as well, but it is true that they often may use different PyTorch versions. GitHub also updates their dependencies on their own schedule, so when 1.6 came out for example the next day we had the daily CI test failing.
@glenn-jocher correct me if I am wrong, but both nvcr.io/nvidia/pytorch:20.11-py3 and nvcr.io/nvidia/pytorch:20.10-py3 seems to use python 3.6
This project requests 3.8 or above.
Will this be a problem?
@cesarandreslopez yes I noticed that as well. I'm not sure if 3.6.0 is compatible with this repo, I think the last one I checked was using 3.6.9. I'm doing all development in 3.8.0, but in general backwards compatibility is something I don't have lots of time to maintain and verify, which is the reason I've simply put 3.8 down as the requirement.
But as you're seeing 3.7 appears compatible, as well as possibly much of the 3.6.
Hi, guys. @NanoCode012 @glenn-jocher The following code works for me: with torch.no_grad(): b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls
@MingcongCao ah, I've resolved the original issue by resetting the base image to Nvidia 20.10, so all docker operations should be operating correctly now.
I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?
python train.py --batch-size 64 --data ./data/coco128.yaml --cfg ./models/yolov5s.yaml --weights ''
Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB)
Traceback (most recent call last):
File "train.py", line 492, in
@hcodee , could you try stable torch 1.7?
The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8.
The torch 1.7 does not work with RTX3090. Takes long time to figure out to run on nightly build Torch 1.8.
You need to compile pytorch yourself witch cuda 11.1 installed. It is doable, I did it without any hassle ( surprisingly ) from master. Unfortunately I need to do it again for 1.7
@batrlatom Cool, Thanks remind. I will try it out.
I have met this issue with RTX3090 & Cuda 11.1.0. Is there any solution for this configuration?
python train.py --batch-size 64 --data ./data/coco128.yaml --cfg ./models/yolov5s.yaml --weights ''
Using torch 1.8.0.dev20201117 CUDA:0 (GeForce RTX 3090, 24265MB)
Traceback (most recent call last): File "train.py", line 492, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 91, in train model = Model(opt.cfg, ch=3, nc=nc).to(device) # create File "/home/yons/work/yolov5/models/yolo.py", line 95, in init self._initialize_biases() # only run once File "/home/yons/work/yolov5/models/yolo.py", line 150, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
Same problem with nighly pytorch version here. Any luck with using the self compiled pytorch 1.8?
I met the same issue with pytorch 1.8, and the following code works for me:
b.data[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls
I just ran into this issue myself, so it's time for a fix :) Will add a TODO and prioritize this for a fix ASAP.
@DoctorKey can confirm your solution works correctly. I will submit a PR for this to master.
@NanoCode012 @DoctorKey @batrlatom @hcodee this problems should be resolved now by implementing @DoctorKey fix in PR #1759. Docker image for ultralytics/yolov5:latest should be updated in a few minutes with this fix.
Let me know if any other issues pop up, and thank you for your contributions!
Hi i am new to this i just encountered a runtime problem
Traceback (most recent call last):
File "train.py", line 492, in
i am using torch 1.8.0+cu101
i really dont know what to do. any help please
@Nytsirch this error is likely generated by an unsupported 3rd party notebook. Please see the official YOLOv5 Colab Notebook below, and visit the Train Custom Data Tutorial to get started with YOLOv5. https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.
Hi i am new to this i just encountered a runtime problem Traceback (most recent call last): File "train.py", line 492, in train(hyp, opt, device, tb_writer, wandb) File "train.py", line 91, in train model = Model(opt.cfg, ch=3, nc=nc).to(device) # create File "/content/yolov5/models/yolo.py", line 95, in init self._initialize_biases() # only run once File "/content/yolov5/models/yolo.py", line 150, in _initialize_biases b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image) RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
i am using torch 1.8.0+cu101
i really dont know what to do. any help please
change the two old lines in 'yolo.py'
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
b[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls
to new
b.data[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
b.data[:, 5:] += math.log(0.6 / (m.nc - 0.99)) if cf is None else torch.log(cf / cf.sum()) # cls
🐛 Bug
I got the below error message when I try to test out the latest commit cff9263490fbf4b80dcc2d87914e087e6c07b6a0 on a new docker image. I haven't pulled recently, so I'm not sure which commit made this error.
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
To Reproduce (REQUIRED)
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache
Output:
Expected behavior
Run normally
Environment
Additional context
It seems to run fine when I'm running from an old conda py37 environment with torch 1.6. I cannot reproduce this error on Google Colab. Could there be something wrong with Docker dependencies?