Negative-width bounding boxes when running on M1 (mps) HW

agentmorris commented 8 months ago

Search before asking

[X] I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Validation

Bug

(At Glenn's suggestion, transferring from issue 12645, which I originally filed as a question.)

I am running a trained YOLOv5x6 model using val.py with the --save_json option. I have a few images where the resulting .json file includes one or more boxes with negative width values (not negative x or y values, which seem normal and are discussed in other issues, but negative width values), but I have only observed this behavior when running on M1 hardware (with --device mps). This issue occurs in the current YOLOv5 Python environment on M1 HW, but does not occur with at least one older YOLOv5 environment on M1 HW, and AFAIK does not occur with any YOLOv5 environment on CUDA/x86 HW.

Environment

Hardware/OS environment

An AWS M1 (mac2.metal) VM running OSX

Python environment

pip's list of packages in the "new YOLOv5" environment (in which I am able to reproduce this issue) is here. This directly follows "pip install -r requirements.txt"; I made no other changes to this environment. This is a Python 3.11 environment.
pip's list of packages in the "old YOLOv5" environment is here. This is not strictly a pip install of the requirements.txt file from the (old) YOLOv5 repo, but in this case, I don't think it matters for purposes of reproducing the issue, since this is the environment that doesn't exhibit the behavior in question. This is a Python 3.8 environment.

Both environments were created via Miniforge.

Minimal Reproducible Example

I am able to share images that reproduce this behavior now, and I also have some new data that might get us closer to a root cause:

In a Python environment based on the current YOLOv5 repo, I am able to reproduce this negative-box behavior on an M1 Mac (specifically a mac2.metal M1 AWS VM).
In a Python environment with older versions of most dependencies, with an old version of the YOLOv5 repo (based on this commit), on the same M1 machine, I am not able to reproduce this negative-box behavior.
Either Python/YOLOv5 environment works fine (i.e., does not produce negative-width boxes) on a Linux+CUDA or Linux+CPU environment.
The Linux+CUDA results in either environment are nearly identical to the "old YOLOv5" M1 results, with just acceptable differences out at the fourth or fifth decimal place.

The command I am running on the M1 VM is:

python val.py --task test --data "/Users/ec2-user/images/dataset.yaml" --weights "/Users/ec2-user/md_v5a.0.0.pt" --batch-size 1 --imgsz 1280 --conf-thres 0.001 --device "mps" --save-json --project "/Users/ec2-user/yolo-results/yolo-new-no-aug" --name "yolo_results" --exist-ok --save-txt --save-conf

Images are here (images.zip, three small images)
Model weights are here (md_v5a.0.0.pt, a trained YOLOv5 model)
A suitable dataset.yaml file is here (the absolute path on the first line will need adjusting)
The COCO-formatted results from the "new YOLOv5" environment are here (with negative-width boxes)
The COCO-formatted results from the "old YOLOv5 "environment are here (no negative boxes, nearly identical to the results from a CUDA environment)
pip's list of packages in the "new YOLOv5" environment is here. This directly follows "pip -r requirements.txt"; I made no other changes to this environment. This is a Python 3.11 environment.
pip's list of packages in the "old YOLOv5" environment is here. This is not strictly a pip install of the requirements.txt file from the (old) YOLOv5 repo, but in this case, I don't think it matters for purposes of reproducing the issue, since this is the environment that doesn't exhibit the behavior in question. This is a Python 3.8 environment.

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

glenn-jocher commented 8 months ago

@agentmorris hello! Thank you for the detailed report and for following up on the previous issue. It's quite intriguing that you're observing negative-width bounding boxes exclusively on M1 hardware with MPS. This could be related to differences in the MPS backend or a specific library version incompatibility.

To help us diagnose and address this issue, could you please:

Confirm that you're using the latest commit from the YOLOv5 repository.
Test with different versions of PyTorch, especially the one used in your "old YOLOv5" environment where the issue does not occur.
If possible, isolate the issue by running inference on a single image that produces a negative-width bounding box and share the verbose output.

Your cooperation is much appreciated! We'll look into this as soon as we have more information. Meanwhile, for further guidance, please refer to our documentation at https://docs.ultralytics.com/yolov5/.

Thank you for your contribution to improving YOLOv5! 🚀

agentmorris commented 8 months ago

I have a more self-contained repro now, from a brand new AWS mac2.metal VM...

# Install miniforge
brew install wget
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh --no-check-certificate
chmod a+x Miniforge3-MacOSX-arm64.sh
./Miniforge3-MacOSX-arm64.sh
source ~/.zshrc

# Get the model weights, dataset file, and test image
mkdir ~/images
wget https://github.com/agentmorris/MegaDetector/releases/download/v5.0/md_v5a.0.0.pt -O ~/images/md_v5a.0.0.pt --no-check-certificate
wget http://dmorris.net/misc/tmp/m1-yolo-issue/n7_2019-03-19_07-25-00.JPG -O ~/images/n7_2019-03-19_07-25-00.JPG
wget http://dmorris.net/misc/tmp/m1-yolo-issue/dataset.yaml -O ~/images/dataset.yaml

# Check out both YOLOv5 versions ("new" and "old") to separate folders
git clone https://github.com/ultralytics/yolov5 yolov5-new

git clone https://github.com/ultralytics/yolov5 yolov5-old
cd yolov5-old
git checkout c23a441c9df7ca9b1f275e8c8719c949269160d1

# Create Python environments
mamba create -n yolov5-new python=3.11 pip -y
cd ~/yolov5-new && mamba activate yolov5-new
pip install -r requirements.txt

mamba create -n yolov5-old python=3.8 pip -y
cd ~/yolov5-old && mamba activate yolov5-old
pip install -r requirements.txt

# The old YOLOv5 requirements.txt file specified numpy>=1.18.5, which mamba 
# satisfies with 1.24.4 as of 2024.01.21.  This results in "AttributeError: module 'numpy' 
# has no attribute 'int'. ".  So we roll back numpy to 1.21.4, which is still compatible with 
# the requirements.txt file.
pip uninstall -y numpy && pip install numpy==1.21.4

# Test
cd ~/yolov5-new && mamba activate yolov5-new
python val.py --task test --data "/Users/ec2-user/images/dataset.yaml" --weights "/Users/ec2-user/images/md_v5a.0.0.pt" --batch-size 1 --imgsz 1280 --conf-thres 0.001 --device "mps" --save-json --project "/Users/ec2-user/yolo-results/yolo-new" --name "yolo_results" --exist-ok --save-txt --save-conf

cat ~/yolo-results/yolo-new/yolo_results/md_v5a.0.0_predictions.json

# [{"image_id": "n7_2019-03-19_07-25-00", "category_id": 0, "bbox": [1252.152, 994.286, -745.002, 257.866], "score": 0.96902}]

cd ~/yolov5-old && mamba activate yolov5-old
python val.py --task test --data "/Users/ec2-user/images/dataset.yaml" --weights "/Users/ec2-user/images/md_v5a.0.0.pt" --batch-size 1 --imgsz 1280 --conf-thres 0.001 --device "mps" --save-json --project "/Users/ec2-user/yolo-results/yolo-old" --name "yolo_results" --exist-ok --save-txt --save-conf

cat ~/yolo-results/yolo-old/yolo_results/md_v5a.0.0_predictions.json

# [{"image_id": "n7_2019-03-19_07-25-00", "category_id": 0, "bbox": [135.414, 994.286, 371.736, 257.866], "score": 0.96902}]

glenn-jocher commented 8 months ago

@agentmorris, fantastic work on creating a self-contained reproducible example! This will greatly assist in debugging the issue. The negative-width bounding box in the new environment versus the correct output in the old environment suggests a regression or incompatibility introduced in the newer software stack.

Given the detailed steps you've provided, we will:

Replicate your environment and run the provided commands.
Investigate any changes between the old and new versions of YOLOv5, as well as differences in dependency versions, particularly those related to MPS support.
Look into the MPS backend processing to identify any potential source of the negative-width bounding box issue.

Your thorough testing and reporting are invaluable to the YOLOv5 community and the Ultralytics team. We'll update you as soon as we have more insights or require further information.

Thank you for your dedication to improving YOLOv5! 🌟

github-actions[bot] commented 7 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

agentmorris commented 7 months ago

Hopefully the GHA bot isn't going to automatically close this issue? This seems like a fairly severe issue; the negative boxes are just the manifestation that's easy to detect, the underlying issue is better described as "large discrepancies between M1 results and other results".

If I can ignore the GHA bot, you can ignore this comment. :)

glenn-jocher commented 7 months ago

@agentmorris, rest assured, we'll ensure this issue remains open and actively investigated given its significance. The discrepancies you've highlighted, especially with the MPS backend on M1 hardware, are indeed critical to address for ensuring consistent and reliable model performance across different platforms.

Your findings and the effort you've put into documenting this issue are greatly appreciated. We'll prioritize looking into this and keep you updated on our progress. Please feel free to add any further observations or data you may gather as we work towards a resolution.

Thank you for your patience and for contributing to the robustness of YOLOv5! 🛠️

github-actions[bot] commented 6 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

agentmorris commented 6 months ago

@glenn-jocher Were you able to assess the scope of this issue before closing? The negative-width bounding boxes were just the symptom that let us find this issue; the fact that results are incorrect on M1 HW at all seems like a possibly-big deal, unless there's something specific about this repro that limits the scope. Any ideas?

glenn-jocher commented 6 months ago

@agentmorris, absolutely, your concern is valid and recognized. I've reviewed the scope, and indeed, the issue extends beyond just negative-width bounding boxes—highlighting discrepancies in results on M1 hardware is critical. We're diving deeper to understand the root cause and its implications. Once we have more clarity on the specific conditions or factors contributing to this issue, we'll update. Your insight has been invaluable in unveiling this; rest assured, we're on it! 🚀

agentmorris commented 6 months ago

Thanks. The github-actions bot tricked me again. :)

glenn-jocher commented 6 months ago

@agentmorris, haha, those bots can be quite sneaky! 😄 If there's anything more you need help with or any more insights you gather, feel free to share. We're all ears and here to support. Happy coding!

ultralytics / yolov5