KeyError when running detect.py with MPS (M1/M2) + fix

Search before asking

[X] I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Detection

Bug

When running detection with --device=mps, I occasionally get a KeyError crash:

detect: weights=['yolov5n.pt'], source=/Users/reinoud/Desktop/00102.png, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.1, iou_thres=0.45, max_det=1000, device=mps, view_img=False, save_txt=True, save_conf=True, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=/tmp/fullvideo2, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
YOLOv5 🚀 v7.0-172-gc3c1304 Python-3.11.2 torch-2.0.1 MPS

Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
Traceback (most recent call last):
  File "/Volumes/Work/megadetector/yolov5-original/detect.py", line 261, in <module>
    main(opt)
  File "/Volumes/Work/megadetector/yolov5-original/detect.py", line 256, in main
    run(**vars(opt))
  File "/Users/reinoud/venvs/yolov5-original/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/Work/megadetector/yolov5-original/detect.py", line 172, in run
    label = None if hide_labels else (names[c] if hide_conf else f'{names[c]} {conf:.2f}')
                                                                    ~~~~~^^^
KeyError: 518

It's looking for detection class 518, which does not exist.

Doing the detection on the CPU has no issue.

The error happens when only a single item is detected.

Environment

YOLOv5 🚀 v7.0-172-gc3c1304 Python-3.11.2 torch-2.0.1 CPU
MacOs Ventura 13.4.1 (22F82)
MacBook Pro M2 Max

Minimal Reproducible Example

python detect.py --weights yolov5n.pt --source 00102.png --conf-thres 0.1 on the following image 00102 (not sure if GitHub reencodes the file; sha1 should be 1365abbd15f0f5db0d028db9d5d14c87018e99bd).

However, as far as I have been able to determine, this will fail on any detection run that will have 1 result.

Additional

This bug also seems to have been mentioned in #9900, however there also seem to be some other stuff going on there, so making this it's own bug.

Debugging and comparing results between the CPU and MPS, I found the (what I expect) offending line:

detect.py line 163: for *xyxy, conf, cls in reversed(det):

Reversing det is done so that results are plotted from low to high confidence: 7875f4c. This works well on CPU (reverse is always done in dimension 0, so in case there is 1 detection, nothing changes). However the MPS tensor reverses in dimension 0 if there is more than one detection (which is what we want), however if there is only a single detection, the reversed is applied to dimension 1.

I put a print(det) and print(reversed(det)) right above this line.

In the case of multiple detections (different image), it prints:

tensor([[3.00000e+00, 1.01000e+02, 4.04000e+02, 6.33000e+02, 1.43421e-01, 2.10000e+01],
        [1.55300e+03, 2.69000e+02, 1.82700e+03, 6.00000e+02, 1.03632e-01, 5.00000e+01]], device='mps:0')
tensor([[1.55300e+03, 2.69000e+02, 1.82700e+03, 6.00000e+02, 1.03632e-01, 5.00000e+01],
        [3.00000e+00, 1.01000e+02, 4.04000e+02, 6.33000e+02, 1.43421e-01, 2.10000e+01]], device='mps:0')

In the case of a single detection:

tensor([[5.18000e+02, 9.10000e+01, 6.09000e+02, 2.00000e+02, 1.20626e-01, 5.00000e+01]], device='mps:0')
tensor([[5.00000e+01, 1.20626e-01, 2.00000e+02, 6.09000e+02, 9.10000e+01, 5.18000e+02]], device='mps:0')

Hence the class it's predicting, is actually the X coordinate....

~~It feels to me that this is a bug in pytorch, but I really don't know enough to claim it is.~~ Update: yes it's a bug in PyTorch (pytorch/pytorch#96558), fixed in the nightly (pytorch/pytorch@c95bcb669492805bd5cb73f40958ccafbfc096a3).

Anyways, solution in yolo seems easy (hacky, but easy):

                for *xyxy, conf, cls in det if len(det) == 1 else reversed(det):

More than happy to make a PR for this, but it feels like a current dev can probably fix it 100x faster than me making a PR :). However let me know!

I just ran a test on a 50k frames video. Usually difference (in output reported through --save-txt) between CPU and MPS is 0. When there is a difference, most common is in the order of 1e-6 in the confidence. 1 in 100 lines has a small < 1e-4 difference in the coordinates (which is about 1/10th pixel assuming an image-size of 1000x1000), however I did find one frame with a 1e-2 diff in coordinates (which is 10 pixels on that image size; however visually comparing the locations, I wouldn't be able to say which one was more correct (I expect that in this specific case a tiny change confidence meant that another box was selected during NMS). All in all, for me this shows that MPS is usable during detection.

Are you willing to submit a PR?

[X] Yes I'd like to help by submitting a PR!

👋 Hello @reinhrst, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

Thank you for providing a detailed explanation of the issue you are encountering when running YOLOv5 with MPS on Apple M1 and M2 chips. From the information you provided, it seems that you have identified a specific edge case where reversing the tensor is not working as expected when using MPS, especially when there is only one detection.

Your proposed solution of adding a condition to check the length of the det tensor before reversing it is a practical approach to addressing this issue.

Here is how you can clean up the code a little:

dets_to_iterate = det if len(det) == 1 else reversed(det)
for *xyxy, conf, cls in dets_to_iterate:

This ensures that for single detection cases, the tensor isn't reversed, avoiding the unexpected behavior with MPS.

Since you have also mentioned that there is an update in PyTorch that addresses the issue, it is worth noting that updating the PyTorch version to the latest one which includes the fix can be another solution. However, adding the check in the code can serve as a safeguard.

I would encourage you to go ahead and submit a PR to the YOLOv5 repository. Although you mentioned that a current developer might be able to do it faster, contributing to open source is always a good practice and it helps the community. Make sure to explain the issue and how your changes fix it in the PR description. Also, if possible, add any tests or examples that demonstrate the fix.

Good luck with your PR! And thank you for contributing to the improvement of YOLOv5 and the broader AI community.

This issue should be fixed with the PyTorch update 2.0.0, according to the PyTorch issue linked above.

@reinhrst if you'd still like to make that PR for people running less than 2.0.0, that would be great, as I'm also pretty certain devs use pull requests to make changes to the repo anyway.

This issue should indeed be resolved with the PyTorch update to version 2.0.0, as mentioned in the linked PyTorch issue. @democat3457, if you would like to contribute by making a pull request to address the issue for users running versions below 2.0.0, it would be greatly appreciated. Pull requests are typically used by developers to propose changes to a repository, making them an excellent way to contribute to the project.

Thank you for your interest in improving YOLOv5 and potentially providing a fix for this issue. Your contribution would be valuable to the community.

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / yolov5