Issues when using val.py on testing dataset

fyang5 commented 11 months ago

Search before asking

[X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Val, Predict

Bug

I got different results on detect mode using predict.py and val.py with the same conf threshold and iou threshold (based on 5433 images for one class detection). Here is the details:

Using predict.py: yolo detect predict model=./ultralytics/runs/detect/train5/weights/best.pt source=./datasets/splitteddata/images/test/positive//*jpg device=0 save_txt=True imgsz=640 save_conf=True save=True max_det=300 iou=0.45 conf=0.1
Using val.py: yolo detect val model=./ultralytics/runs/detect/train5/weights/best.pt data=./ultralytics/cfg/datasets/mydataset.yaml split=test imgsz=640 batch=1 save_hybrid=False conf=0.1 iou=0.45 max_det=300 device=0. in mydataset.yaml file, the test part is as below: "... test: ./datasets/splitteddata/images/test/positive " However, with the first command, 4145 labels of 5433 images were detected objects, while with the second command, it shows in the command window that the results is :
Class Images Instances Box(P R mAP50 mAP50-95): 100%|██████████| 5433/5433 [01: all 5433 5433 0.832 0.692 0.796 0.4725. The bug is that, in the confusion matrix it showed that a total of 5433+327 images were used in the val mode(See the image attached).

Additional information: if I change the batch to 16 for val.py, then the confusion matrix will get 5433+720 images, and the results for val.py will be P=1 and R=1, which is impossible in real world.

Thanks!

Environment

Ubuntu 22.04 yolov8 Spyder GPU NVIDIA 3070TI

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

glenn-jocher commented 11 months ago

@fyang5 hello! Thank you for reporting this issue and providing a detailed description. I understand that you're seeing a discrepancy in detection results between using predict.py and val.py on your test dataset, along with some confusion matrix anomalies.

The differences you're observing might result from distinct underlying processes in these scripts. While predict.py is primarily for inference on any input source, generating predictions, saving them, and optionally their corresponding confidence scores, val.py is tailored for model validation and performance evaluation against a labelled dataset, giving you evaluation metrics like mAP (mean Average Precision).

Regarding the predict and val comparisons, make sure the test datasets referenced in both commands precisely match. Check for any variations, such as possible symbolic links, different image lists, or caches that might affect the dataset loaded in each case.

The confusion matrix anomaly, showing more images than in your dataset, could hint at possible duplicates in your validation loader or an issue with the test split definition in your dataset YAML file. Verify your dataset configuration and ensure no overlapping or repeated image references that could inflate the numbers.

The unexpected perfect precision and recall you mentioned when changing the batch size are indeed unrealistic, and they may indicate a critical bug. It's important to dissect this by eliminating potential side effects such as caching issues or misconfigurations.

Lastly, I would recommend pulling the latest update from our repository and trying again, as we continually fix bugs and improve the codebase, which might resolve your problem.

If the issue persists after trying these suggestions, please document your findings and share any additional information that could help isolate the bug. Open a GitHub issue with details, and we'll look into it. Our goal is to ensure that all functionalities, including validation and prediction, work seamlessly for our users.

Thank you for your contribution to the YOLO community! 🌟

fyang5 commented 11 months ago

Hi @glenn-jocher , I checked my files, and I confirm that the testing files are the same for predict.py and val.py. I deleted .cache file each time before I run the testing. Also, I found that by using detect.py, when I changed iou_thresh from 0.45 to 0.20, the total detected bounding boxes reduced slightly. But in theory, it should be the opposite. In yolov5, it works normal, but not for yolov8.

Thanks!

glenn-jocher commented 11 months ago

Hello again, @fyang5!

It's great to hear that you've double-checked your files and cleared cache to rule out those potential sources of discrepancy. The issue with the changing IoU threshold resulting in fewer detected bounding boxes when it should, in theory, increase detections, is indeed peculiar.

In principle, a lower IoU threshold should decrease the strictness for matching a detection with a ground truth, which typically results in more detections being counted as true positives—though with a trade-off of potentially more false positives as well. If you're witnessing the opposite behavior, it suggests there may be a deeper issue, possibly with the IoU calculation or threshold application logic.

Would you be able to conduct a thorough check of the IoU threshold's behavior by running a series of tests where you incrementally adjust the IoU threshold from a very low value (e.g., 0.1) up to a high value (e.g., 0.7) and observe the changes in detection count? This could offer a more detailed view of how the IoU threshold is affecting detections.

Considering that YOLOv5 and YOLOv8 share similarities but also have distinct differences, it's important to narrow down if this behavior is specific to YOLOv8. We appreciate that you provided feedback on how it works in YOLOv5, which serves as a valuable reference point.

It's crucial for us to ensure that the tools we provide function as intended. Therefore, your insights are very helpful in helping us identify and rectify potential issues. I would also encourage you to provide this new information in a GitHub issue so that the Ultralytics team can replicate and investigate this anomaly in more depth.

We greatly appreciate your collaborative approach and are committed to resolving this issue efficiently. Thank you for your efforts in improving YOLOv8! 🛠💡

github-actions[bot] commented 10 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

naveenvj25 commented 9 months ago

I, too, have the same issue. There is a difference in the detection results. From the same directory only I am passing the images to both codes. But when I cross-checked the predicted images, there was a considerable difference. Could I use the validation code for testing just to get the evaluation metrics? Does the model use the passed labels for detection, or is it just used for evaluation only?

glenn-jocher commented 9 months ago

@naveenvj25 hello!

Yes, you can use the validation code (val.py) to get evaluation metrics on your test set. The model does not use the passed labels for detection; it only uses them for evaluation purposes to compare the predicted results against the ground truth and calculate metrics like precision, recall, and mAP.

If you're observing differences in detection results, ensure that the same configuration, including confidence and IoU thresholds, is used in both cases. Discrepancies can sometimes arise from different settings or data handling between scripts.

Thank you for your input, and we're here to help if you have further questions!

naveenvj25 commented 9 months ago

Thank you for the information. I will now use the validation script for evaluation and use the predicted txt file output from validation to draw bounding box using my script.

glenn-jocher commented 9 months ago

@naveenvj25 you're welcome! Using the validation script for evaluation and leveraging the output text files for custom bounding box drawing sounds like a solid plan. If you need further assistance, feel free to reach out. Happy coding! 🚀

ultralytics / ultralytics