Closed cubrink closed 3 years ago
👋 Hello @cubrink, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.
If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.
For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7
. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.
I did some further testing by checking out different commits. The first commit that produced the bug is this one.
@cubrink thanks for the bug report. Conda environments may produce problems sometimes, I would recommend you try one of our verified environments above, including the Docker image.
A few comments about your example:
@glenn-jocher, thank you for the advice, I'm aware of the auto-downloading features but the firewall on the machine in question blocks the download.
I've since ran the script in the Docker container using DDP and have ran into similar issues.
Edit: I can again confirm that the first commit to break the training script is this one. This was the same commit that broke train.py
in my initial bug report.
It looks like the line that breaks train.py
is:
if tb_writer:
tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), []) # add model graph
On line 333 of train.py
. I was able to fix train.py
by commenting out the previously mentioned lines.
Within docker, using DDP: The training hangs indefinitely shortly after starting. Within docker, without DDP: The original error occurs. (Tensor is printed then script exits)
Basic setup
# Volume added due to autodownload being blocked due to firewall settings
# On a different machine the volume would not be needed.
# The data is from official sources so they should work without issue.
sudo docker --rm --ipc=host --gpus all -it -v <my_local_path>/yolov5_resources ultralytics/yolov5:latest
# (Due to personal firewall settings, copy weights into ./weights and dataset into ../coco128)
(At the time of writing the commit from ultralytics/yolov5:latest is 61ea23c
)
python -m torch.distributed.launch \
--nproc_per_node 4 \
train.py \
--weights weights/yolov5m.pt \
--cfg models/yolov5m.yaml \
--data data/coco128.yaml \
--device 0,1,2,3
Result: Training hangs after 2 batches. I've let this sit for about 10 minutes without progress.
(Note the warning that is raised before hanging: UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as it.
)
python train.py \
--weights weights/yolov5m.pt \
--cfg models/yolov5m.yaml \
--data data/coco128.yaml \
--device 0,1,2,3
Result: Tensor is printed by train.py
then then the script exits. (This is the same bug that was originally reported.)
Regular training on the coco128 datatset.
I have also encountered with this error training with my own dataset. Just to add some information about the problem, the stack trace before printing the full tensor is:
Traceback (most recent call last):
File "train.py", line 591, in <module>
train(hyp, opt, device, tb_writer)
File "train.py", line 374, in train
tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), []) # add model graph
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 733, in trace
return trace_module(
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 934, in trace_module
module._c._create_method_from_trace(
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 725, in _call_impl
result = self._slow_forward(*input, **kwargs)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 709, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
return self.gather(outputs, self.output_device)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient
Tensor:
[...]
Additionally, as @cubrink mentioned, commenting the line out the script runs correctly.
Thanks guys. This must be related to a recent PR https://github.com/ultralytics/yolov5/pull/3236 that re-added the TensorBoard graph for interactive model architecture viewing in TensorBoard. It seems to be failing in some environments as in your examples above though. I'll take a look at this today. Worst case scenario we can drop it into a try: except statement.
@adrigrillo your error in particular seems to imply that a model with gradients might be causing the issue, or perhaps the nn/parallel lines are implying that only DP or DDP models in particular are causing the error ?
Now that you comment that, the problem could be produced due to the use of DP mode which I was not intending to use in reality. I forgot to specify the GPU to use, as I wanted to use single GPU training, and I guess that, by default, if two GPUs are available and no GPU flag specified the training will use both in DP mode.
Anyways, commenting the line makes it work in DP mode and with one GPU also works. Therefore, the problem may be related to the multi-gpu training and not with the gradient message.
@adrigrillo ok thanks, that makes sense then. It's likely that the graph/script functions were just never intended for use with DP/DDP. I think we can use the same if is_parallel()
statement used elsewhere in these cases:
https://github.com/ultralytics/yolov5/blob/407dc5008e47b1aad5ce69f0c91b4f1ec321dd7f/train.py#L393
@adrigrillo @cubrink good news 😃! Your original issue may now be fixed ✅ in PR #3325. This PR de-parallelizes the model before passing it to the TensorBoard add_graph() function. This should resolve the original issue if it was only observed in multi-GPU trainings. There is a userwarning currently when the graph is saved, but this is expected and should not cause any problems (UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as it.
) To receive this update:
git pull
from within your yolov5/
directory or git clone https://github.com/ultralytics/yolov5
againmodel = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
sudo docker pull ultralytics/yolov5:latest
to update your image Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!
@glenn-jocher Thanks for your great work, but I found the problem mentioned in https://github.com/ultralytics/yolov5/issues/3284#issuecomment-847270860 still existed. when I use Docker to run the code with DDP, the training hangs indefinitely.
@shufanwu hi thanks for the feedback. Can you confirm you are seeing this error in the latest Docker image? You can pull the latest image using the command below.
sudo docker pull ultralytics/yolov5:latest
to update your image @glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed. when I use Docker to run the code with DDP, the training hangs indefinitely.
https://github.com/ultralytics/yolov5/pull/3325 solved for me
@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed. when I use Docker to run the code with DDP, the training hangs indefinitely.
3325 solved for me
In addition, develop
(default) branch still has this issue, whereas master
branch was fixed.
@kanybekasanbekov @SkalskiP yes as you noticed we have a new develop
branch. The idea is to adopt a more best practices workflow where we mostly update the develop branch and periodically merge to master on a new patch release.
master
< develop
< feature
Though bug fixes will take a different route:
master
< fix
🐛 Bug
A clear and concise description of what the bug is.
When using
train.py
a tensor is printed to screen and then the script ends. No training occurs.To Reproduce (REQUIRED)
Input:
Output: The training script starts normally and sits idle briefly. Then a tensor is printed to screen and the script ends. Inititally Then,
After the script has stopped
runs/train/yolov5-bug
is only partially filled. I'm unsure if this is relevant.runs/train/yolov5-bug
contents:(
.tfevents
filename partially redacted for privacy reasons)Expected behavior
A clear and concise description of what you expected to happen.
Regular training on the coco128 test dataset.
Environment
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
I've used older versions of YoloV5 before without issue. I recently decided to try updating before I ran into this issue. I will not be able to access this machine again until Monday.