ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.01k stars 16.17k forks source link

Negative-width bounding boxes when running on M1 (mps) HW #12654

Closed agentmorris closed 6 months ago

agentmorris commented 8 months ago

Search before asking

YOLOv5 Component

Validation

Bug

(At Glenn's suggestion, transferring from issue 12645, which I originally filed as a question.)

I am running a trained YOLOv5x6 model using val.py with the --save_json option. I have a few images where the resulting .json file includes one or more boxes with negative width values (not negative x or y values, which seem normal and are discussed in other issues, but negative width values), but I have only observed this behavior when running on M1 hardware (with --device mps). This issue occurs in the current YOLOv5 Python environment on M1 HW, but does not occur with at least one older YOLOv5 environment on M1 HW, and AFAIK does not occur with any YOLOv5 environment on CUDA/x86 HW.

Environment

Hardware/OS environment

Python environment

Both environments were created via Miniforge.

Minimal Reproducible Example

I am able to share images that reproduce this behavior now, and I also have some new data that might get us closer to a root cause:

The command I am running on the M1 VM is:

python val.py --task test --data "/Users/ec2-user/images/dataset.yaml" --weights "/Users/ec2-user/md_v5a.0.0.pt" --batch-size 1 --imgsz 1280 --conf-thres 0.001 --device "mps" --save-json --project "/Users/ec2-user/yolo-results/yolo-new-no-aug" --name "yolo_results" --exist-ok --save-txt --save-conf

Additional

No response

Are you willing to submit a PR?

glenn-jocher commented 8 months ago

@agentmorris hello! Thank you for the detailed report and for following up on the previous issue. It's quite intriguing that you're observing negative-width bounding boxes exclusively on M1 hardware with MPS. This could be related to differences in the MPS backend or a specific library version incompatibility.

To help us diagnose and address this issue, could you please:

  1. Confirm that you're using the latest commit from the YOLOv5 repository.
  2. Test with different versions of PyTorch, especially the one used in your "old YOLOv5" environment where the issue does not occur.
  3. If possible, isolate the issue by running inference on a single image that produces a negative-width bounding box and share the verbose output.

Your cooperation is much appreciated! We'll look into this as soon as we have more information. Meanwhile, for further guidance, please refer to our documentation at https://docs.ultralytics.com/yolov5/.

Thank you for your contribution to improving YOLOv5! πŸš€

agentmorris commented 8 months ago

I have a more self-contained repro now, from a brand new AWS mac2.metal VM...

# Install miniforge
brew install wget
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh --no-check-certificate
chmod a+x Miniforge3-MacOSX-arm64.sh
./Miniforge3-MacOSX-arm64.sh
source ~/.zshrc

# Get the model weights, dataset file, and test image
mkdir ~/images
wget https://github.com/agentmorris/MegaDetector/releases/download/v5.0/md_v5a.0.0.pt -O ~/images/md_v5a.0.0.pt --no-check-certificate
wget http://dmorris.net/misc/tmp/m1-yolo-issue/n7_2019-03-19_07-25-00.JPG -O ~/images/n7_2019-03-19_07-25-00.JPG
wget http://dmorris.net/misc/tmp/m1-yolo-issue/dataset.yaml -O ~/images/dataset.yaml

# Check out both YOLOv5 versions ("new" and "old") to separate folders
git clone https://github.com/ultralytics/yolov5 yolov5-new

git clone https://github.com/ultralytics/yolov5 yolov5-old
cd yolov5-old
git checkout c23a441c9df7ca9b1f275e8c8719c949269160d1

# Create Python environments
mamba create -n yolov5-new python=3.11 pip -y
cd ~/yolov5-new && mamba activate yolov5-new
pip install -r requirements.txt

mamba create -n yolov5-old python=3.8 pip -y
cd ~/yolov5-old && mamba activate yolov5-old
pip install -r requirements.txt

# The old YOLOv5 requirements.txt file specified numpy>=1.18.5, which mamba 
# satisfies with 1.24.4 as of 2024.01.21.  This results in "AttributeError: module 'numpy' 
# has no attribute 'int'. ".  So we roll back numpy to 1.21.4, which is still compatible with 
# the requirements.txt file.
pip uninstall -y numpy && pip install numpy==1.21.4

# Test
cd ~/yolov5-new && mamba activate yolov5-new
python val.py --task test --data "/Users/ec2-user/images/dataset.yaml" --weights "/Users/ec2-user/images/md_v5a.0.0.pt" --batch-size 1 --imgsz 1280 --conf-thres 0.001 --device "mps" --save-json --project "/Users/ec2-user/yolo-results/yolo-new" --name "yolo_results" --exist-ok --save-txt --save-conf

cat ~/yolo-results/yolo-new/yolo_results/md_v5a.0.0_predictions.json

# [{"image_id": "n7_2019-03-19_07-25-00", "category_id": 0, "bbox": [1252.152, 994.286, -745.002, 257.866], "score": 0.96902}]

cd ~/yolov5-old && mamba activate yolov5-old
python val.py --task test --data "/Users/ec2-user/images/dataset.yaml" --weights "/Users/ec2-user/images/md_v5a.0.0.pt" --batch-size 1 --imgsz 1280 --conf-thres 0.001 --device "mps" --save-json --project "/Users/ec2-user/yolo-results/yolo-old" --name "yolo_results" --exist-ok --save-txt --save-conf

cat ~/yolo-results/yolo-old/yolo_results/md_v5a.0.0_predictions.json

# [{"image_id": "n7_2019-03-19_07-25-00", "category_id": 0, "bbox": [135.414, 994.286, 371.736, 257.866], "score": 0.96902}]
glenn-jocher commented 8 months ago

@agentmorris, fantastic work on creating a self-contained reproducible example! This will greatly assist in debugging the issue. The negative-width bounding box in the new environment versus the correct output in the old environment suggests a regression or incompatibility introduced in the newer software stack.

Given the detailed steps you've provided, we will:

  1. Replicate your environment and run the provided commands.
  2. Investigate any changes between the old and new versions of YOLOv5, as well as differences in dependency versions, particularly those related to MPS support.
  3. Look into the MPS backend processing to identify any potential source of the negative-width bounding box issue.

Your thorough testing and reporting are invaluable to the YOLOv5 community and the Ultralytics team. We'll update you as soon as we have more insights or require further information.

Thank you for your dedication to improving YOLOv5! 🌟

github-actions[bot] commented 7 months ago

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

agentmorris commented 7 months ago

Hopefully the GHA bot isn't going to automatically close this issue? This seems like a fairly severe issue; the negative boxes are just the manifestation that's easy to detect, the underlying issue is better described as "large discrepancies between M1 results and other results".

If I can ignore the GHA bot, you can ignore this comment. :)

glenn-jocher commented 7 months ago

@agentmorris, rest assured, we'll ensure this issue remains open and actively investigated given its significance. The discrepancies you've highlighted, especially with the MPS backend on M1 hardware, are indeed critical to address for ensuring consistent and reliable model performance across different platforms.

Your findings and the effort you've put into documenting this issue are greatly appreciated. We'll prioritize looking into this and keep you updated on our progress. Please feel free to add any further observations or data you may gather as we work towards a resolution.

Thank you for your patience and for contributing to the robustness of YOLOv5! πŸ› οΈ

github-actions[bot] commented 6 months ago

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

agentmorris commented 6 months ago

@glenn-jocher Were you able to assess the scope of this issue before closing? The negative-width bounding boxes were just the symptom that let us find this issue; the fact that results are incorrect on M1 HW at all seems like a possibly-big deal, unless there's something specific about this repro that limits the scope. Any ideas?

glenn-jocher commented 6 months ago

@agentmorris, absolutely, your concern is valid and recognized. I've reviewed the scope, and indeed, the issue extends beyond just negative-width bounding boxesβ€”highlighting discrepancies in results on M1 hardware is critical. We're diving deeper to understand the root cause and its implications. Once we have more clarity on the specific conditions or factors contributing to this issue, we'll update. Your insight has been invaluable in unveiling this; rest assured, we're on it! πŸš€

agentmorris commented 6 months ago

Thanks. The github-actions bot tricked me again. :)

glenn-jocher commented 6 months ago

@agentmorris, haha, those bots can be quite sneaky! πŸ˜„ If there's anything more you need help with or any more insights you gather, feel free to share. We're all ears and here to support. Happy coding!