Different txt result between test and detect

fishawd commented 3 years ago

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

Current repo: 2020.12.10 latest
Common dataset: coco.yaml
Common environment: Python3.8, Pytorch1.6.0

🐛 Bug

Different txt result between test and detect when use --save-txt command

To Reproduce (REQUIRED)

image: coco2017val 000000581100.jpg detect:

python detect.py --weights ./weights/yolov5x.pt --img 640 --conf 0.01 --iou 0.45 --save-txt --save-conf --source ./coco/images/val/000000581100.jpg

txt result:

0 0.844531 0.33125 0.0109375 0.0291667 0.0103378
19 0.800781 0.504167 0.0484375 0.0916667 0.0133362
19 0.0148437 0.415625 0.0296875 0.01875 0.0134354
19 0.0757812 0.4125 0.0234375 0.0208333 0.0136642
14 0.0257812 0.405208 0.0140625 0.0229167 0.0177917
19 0.490625 0.561458 0.2625 0.552083 0.0180817
0 0.0257812 0.405208 0.0140625 0.0229167 0.0305786
19 0.0078125 0.413542 0.015625 0.01875 0.0319214
17 0.490625 0.561458 0.2625 0.552083 0.0356445
19 0.0234375 0.40625 0.01875 0.025 0.0516052
14 0.00703125 0.69375 0.0140625 0.025 0.0806885
45 0.00703125 0.692708 0.0140625 0.0270833 0.300049
19 0.234375 0.559375 0.16875 0.152083 0.651367
23 0.65 0.58125 0.125 0.429167 0.916504
23 0.496094 0.563542 0.264062 0.539583 0.93457

test: coco-custom.yaml

 val: ./coco/images/val (only have 000000581100.jpg)

python test.py --weights ./weights/yolov5x.pt --data data/coco-custom.yaml --img 640 --conf 0.01 --iou 0.45 --save-txt --save-conf

txt result:

23 0.494141 0.564974 0.271094 0.534635 0.931641
23 0.650391 0.579687 0.126563 0.428125 0.919922
19 0.235156 0.559505 0.170312 0.152865 0.538574
14 0.0256226 0.40625 0.0145752 0.0229167 0.032196
45 0.00708008 0.692187 0.0132813 0.0239583 0.0220184
14 0.00688477 0.69375 0.0133789 0.0239583 0.0187378
19 0.0256226 0.40625 0.0145752 0.0229167 0.0167847
14 0.598047 0.51875 0.0273438 0.0208333 0.0137939
14 0.00786133 0.413542 0.0154297 0.01875 0.0122833
19 0.800781 0.504297 0.0484375 0.0903646 0.0120468
14 0.0770264 0.411198 0.0225098 0.0223958 0.0103607

As shown above, not only the results of txt are different, but also some results smaller than the threshold are saved into txt.

Environment

OS: Win 10
GPU: GTX 1660 Ti

glenn-jocher commented 3 years ago

@fishawd thanks for the bug report! I will try to reproduce.

glenn-jocher commented 3 years ago

@fishawd there are differences in the two methods according to your example. First, you should set test.py to --batch 1 as detect.py runs, and then you should set pad = 0.5, to again copy the detect.py dataloader settings. Once you do this results are nearly identical. You also need to realize that results are not constrained to present in the same order, so there is no row similarity expectation.

On COCO128 I get these two results with the above steps for 000000000025.jpg:

23 0.762001 0.49254 0.347197 0.692352 0.801417
23 0.187066 0.873637 0.221204 0.183953 0.676943
23 0.00876289 0.908577 0.0175258 0.148263 0.0224365
23 0.156741 0.918328 0.162197 0.121073 0.0183627
23 0.178919 0.83573 0.132281 0.184333 0.0125276

23 0.178906 0.835681 0.132812 0.183099 0.0125276
23 0.15625 0.91784 0.1625 0.122066 0.0183627
23 0.00859375 0.908451 0.0171875 0.150235 0.0224365
23 0.1875 0.873239 0.221875 0.183099 0.676943
23 0.7625 0.491784 0.346875 0.692488 0.801417

glenn-jocher commented 3 years ago

Also all confidences are above threshold. Confidence is the last column. I don't see any bugs.

fishawd commented 3 years ago

@glenn-jocher Sorry about the threshold. I misread the txt column beacause different from my habit. I run test.py again but have the same result. pad=0.5 is default in test.py‘s dataloader, I did not modify this parameter.

python test.py --batch 1 --weights ./weights/yolov5x.pt --data data/coco-custom.yaml --img 640 --conf 0.01 --iou 0.45 --save-txt --save-conf

23 0.494141 0.564974 0.271094 0.534635 0.931641
23 0.650391 0.579687 0.126563 0.428125 0.919922
19 0.235156 0.559505 0.170312 0.152865 0.538574
14 0.0256226 0.40625 0.0145752 0.0229167 0.032196
45 0.00708008 0.692187 0.0132813 0.0239583 0.0220184
14 0.00688477 0.69375 0.0133789 0.0239583 0.0187378
19 0.0256226 0.40625 0.0145752 0.0229167 0.0167847
14 0.598047 0.51875 0.0273438 0.0208333 0.0137939
14 0.00786133 0.413542 0.0154297 0.01875 0.0122833
19 0.800781 0.504297 0.0484375 0.0903646 0.0120468
14 0.0770264 0.411198 0.0225098 0.0223958 0.0103607

I also test 000000000025.jpg.

detect

23 0.00703125 0.879108 0.0140625 0.07277 0.0114975
23 0.0078125 0.907277 0.015625 0.138498 0.0186157
23 0.182031 0.899061 0.207813 0.13615 0.82959
23 0.771094 0.49061 0.354688 0.694836 0.943359

test

23 0.771094 0.489583 0.353125 0.699237 0.936523
23 0.181787 0.899648 0.208301 0.134977 0.855957
23 0.00737305 0.869718 0.0147461 0.0492958 0.0108719

glenn-jocher commented 3 years ago

Different dataloaders, set padding=0.0 to match and batch=1

bonorico commented 3 years ago

Hi Glenn, thanks for the support. I noted that 'pad' is not an argument of def test. If this pad is necessary to obtain comparable test and detect results, would be maybe worth considering to declare 'pad' as an argument of def test .... ?

Best

glenn-jocher commented 3 years ago

@bonorico sure! If you'd like please submit a PR with your proposed updates.

fishawd commented 3 years ago

@glenn-jocher Thanks! The results are the same when pad=0.0. So why set pad =0.5 default in test.py?

glenn-jocher commented 3 years ago

@fishawd this adds padding around the to reduce edge effects from repeated k=3 zero-padded convolutions, which helps COCO mAP slightly.

bonorico commented 3 years ago

Why is padding specified in test.py and not in detect.py (at least by looking at code front face) ? Why test and detect are not executing the same operations ? At the end one would still expect test and the detect to do basically the same thing: namely to produce (same) predictions from the trained model .... I will sharing asap a reproducible example ....

glenn-jocher commented 3 years ago

@bonorico detect.py and test.py source images from completely different dataloaders with different purposes, one providing labels and the other not.

fishawd commented 3 years ago

@glenn-jocher I see. Thanks!

bonorico commented 3 years ago

Dear Glenn, thanks a lot for your answer. I still find it very confusing (maybe is my fault): 1) "...one providing labels and the other not.": Actually both test and detect do provide labels with --save-txt (and no groundtruth labels). In which sense do you mean that test.py does not provide labels (if I understood correctly) ? 2) "...different dataloaders with different purposes": I'm still puzzled why test and detect would have different purposes and functionalities. To be clear: my goal is to train the model, test it under the estimated weights, and use these weights to detect objects in video, expecting the same test predictions. Am I saying something strange ? If so, how am I supposed to use "detect" to do mp4 object detection (e.g. predictions) in a way that reliably reflects test performances ? Shall I just use test.py on a frame-by-frame basis to obtain reliable mp4 detections then ? I hope I'm making my point clear here. To facilitate this and also to come back to fishawd question I'm sharing a totally reproducible example showing the big difference between test and detect output. Maybe this example can provide the basis to clarify a correct usage of the respective modules, which yields detect results that are consistent with test predictions. If this criterion is not fulfilled I find it difficult to trust detection results. Best

glenn-jocher commented 3 years ago

@bonorico test and detect perform similar functions on the surface, and at one point I was thinking of merging the two into one file, but on the other hand they serve significantly different tasks, and use different dataloaders and use different parameter defaults.

test.py primary task is testing a trained model, returning model metrics such as P, R, mAP, plotting PR curves and confusion matrices, and plotting predicted results in comparison to ground truth. Dataloader is augmentation capable, the same used in train.py, and returns images with ground truth labels, and settings are tuned to produce the best mAP, i.e. --conf 0.001 --iou 0.65.
detect.py primary task is inference on a variety of possible sources (none of which have ground truth labels): images, videos, webcams, RTSP feeds, glob list of media, etc. It uses a simpler dataloader that is not augmentation capable, but is meant to be compatible with as many media formats as possible. Parameter settings are tuned for typical real-world useability, i.e. --conf 0.25 --iou 0.45.
PyTorch Hub models with .autoshape() behaves similarly to detect.py, and does not use a dataloader. See PyTorch Hub tutorial for details.

autolabelling is provided by both test.py and detect.py, but is not the primary function of either.

bonorico commented 3 years ago

Dear Glenn, thanks a lot for your answer.

It seems to me you are saying that test.py is designed to produce performance metrics under (optimal) conditions, that might not be encountered in real-world usages ?

I have extended the example on colab to use detect.py under default settings (--conf 0.25 --iou 0.45.). This yields many less FPs, roughly the same TPs as test.py, but confidence values are now much lower in detect.py than in test.

My conclusions after executing similar experiments on other data are the following so far:

1) detection does not show the same diagnostic characteirstics declared under test on the same piece of data. Typically detection.py has less mAP.

2) Detection mAP keeps degrading as we let --conf tend to the confidence value predicted under test (this typically high).

For instance, setting detect.py to --conf 0.5 yields up to 15% less mAP than test.py, and going downward for --conf > 0.5. Here detect.py is increasing FNs, confirming that detect.py sensitivity is not the one declared by test.

Hence, the predictor does not work in practice as expected. I don't know how customary this set-up is, but I feel it would confuse a number of people with different background than computer science, say statisticians or else.

One question: Would a medical doctor be confident about these results, when using this method to diagnose cancer severity from cancer-cell counts ? If she did, as an extreme example, she might well send patient at home with cancer, telling them they have none.

Going back to the colab example: "bed" is predicted with high confidence (90%) under test, saying the method is highly sensitive for "bed", which is good if you think at the cancer-cell example (we don't wanna miss out any here). However using the method in practice (detect.py) yields a confidence for "bed" of 74%. This may seems high still, but think again cancer-cell application where we would set --conf as high as 80% (or the certified 90%) to reduce as many FPs as possible. This would now cause to squarely miss out on "bed", increasing the FNs.

It is not sufficient to have a tool that seems to perform well, if its performance cannot be certified in a very consistent manner. My humble suggestion is to precisely declare under which conditions test.py metrics are produced, and to enforce such conditions on detect.py, by default, such to produce very consistent results between the two modules. Unrelated tasks such as auto-labeling should be kept clearly separated.

As a footnote I never flag --augment in my examples, so I guess the same dataloader between test and detect is used ?

If I can provide any help let me know.

FP = false positive, TP = true positive, FN = false negative

glenn-jocher commented 3 years ago

@bonorico the current setup follows standard practices. mAP is an all encompassing metric that informs a user of a range of metrics that are achievable.

It is then up to the user to implement their own inference profile based on their domain knowledge and their domain priorities for FP reduction vs FN reduction.

In any case, none of the 'confidences' output by any vision AI system today follow statistical norms for 1-sigma etc. confidence bounds. The nearest you might be able to approach this would be to create your own aposteriori uncertainty profiles using monte carlo methods on empirical results.

As a footnote I never flag --augment in my examples, so I guess the same dataloader between test and detect is used ?

The same dataloader is definitely not used, see previous post.

bonorico commented 3 years ago

Hi Glenn, thanks a lot for your answer. I hope to know what mAP is, I just did not know it was standard practice for a detection method to have different performance metrics than those declared under testing. I hope you realize that, if I have to additionally tune detect.py to have high mAP, it implies a different optimization task on top of the one just made during training. Which brings to the main point of this unresolved discussion: I find it just weird that detect.py does not have the same prediction characteristics obtained with train.py. I guess, for the current usage, I will have to just by-pass test.py and simply do my independent mAP computations on detect predictions, to really know what the detector performances are in practice. Best

glenn-jocher commented 3 years ago

@bonorico I think you're not understanding the concept. mAP informs a user of achievable deployment metrics across all inference thresholds.

In production is it up to the domain expert to use this PR curve to select one set of inference thresholds. It is not possible to measure a mAP at a single confidence threshold, the idea is meaningless.

xuke96 commented 2 years ago

Different dataloaders, set padding=0.0 to match and batch=1

I think this issue will help me a lot,so i already tried to reproduce it.but I have a question about this step. the difficulty i met is that how to ‘copy the detect.py dataloader settings‘.Or...It means that changes dataloader settings in test.py into dataloader settings what is like in the detect.py? Looking forward to your reply.Thank you!

glenn-jocher commented 2 years ago

@xuke96 👋 Hello, thanks for asking about the differences between train.py, detect.py and val.py in YOLOv5 🚀.

These 3 files are designed for different purposes and utilize different dataloaders with different settings. train.py dataloaders are designed for a speed-accuracy compromise, val.py is designed to obtain the best mAP on a validation dataset, and detect.py is designed for best real-world inference results. A few important aspects of each:

train.py

trainloader LoadImagesAndLabels(): designed to load train dataset images and labels. Augmentation capable and enabled. https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/train.py#L210-L213 val_loader LoadImagesAndLabels(): designed to load val dataset images and labels. Augmentation capable but disabled. https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/train.py#L220-L223
image size: 640
rectangular inference: False
confidence threshold: 0.001
iou threshold: 0.6
multi-label: True
padding: None

val.py

dataloader LoadImagesAndLabels(): designed to load train, val, test dataset images and labels. Augmentation capable but disabled.
https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/val.py#L152-L153
image size: 640
rectangular inference: True
confidence threshold: 0.001
iou threshold: 0.6
multi-label: True
padding: 0.5 * maximum stride

detect.py

dataloaders (multiple): designed for loading multiple types of media (images, videos, globs, directories, streams). https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/detect.py#L120-L128
image size: 640
rectangular inference: True
confidence threshold: 0.25
iou threshold: 0.45
multi-label: False
padding: None

YOLOv5 PyTorch Hub Inference

YOLOv5 PyTorch Hub models areAutoShape() instances used for image loading, preprocessing, inference and NMS. For more info see YOLOv5 PyTorch Hub Tutorial https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/models/common.py#L276-L282

image size: 640
rectangular inference: True
confidence threshold: 0.25
iou threshold: 0.45
multi-label: False
padding: None

Good luck 🍀 and let us know if you have any other questions!

bug-xns commented 2 years ago

@xuke96 👋 Hello, thanks for asking about the differences between train.py, detect.py and val.py in YOLOv5 🚀.

These 3 files are designed for different purposes and utilize different dataloaders with different settings. train.py dataloaders are designed for a speed-accuracy compromise, val.py is designed to obtain the best mAP on a validation dataset, and detect.py is designed for best real-world inference results. A few important aspects of each:

train.py

trainloader LoadImagesAndLabels(): designed to load train dataset images and labels. Augmentation capable and enabled. https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/train.py#L210-L213 val_loader LoadImagesAndLabels(): designed to load val dataset images and labels. Augmentation capable but disabled. https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/train.py#L220-L223

image size: 640

rectangular inference: False

confidence threshold: 0.001

iou threshold: 0.6

multi-label: True

padding: None

val.py

dataloader LoadImagesAndLabels(): designed to load train, val, test dataset images and labels. Augmentation capable but disabled.

https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/val.py#L152-L153

image size: 640

rectangular inference: True

confidence threshold: 0.001

iou threshold: 0.6

multi-label: True

padding: 0.5 * maximum stride

detect.py

dataloaders (multiple): designed for loading multiple types of media (images, videos, globs, directories, streams). https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/detect.py#L120-L128

image size: 640

rectangular inference: True

confidence threshold: 0.25

iou threshold: 0.45

multi-label: False

padding: None

YOLOv5 PyTorch Hub Inference

YOLOv5 PyTorch Hub models areAutoShape() instances used for image loading, preprocessing, inference and NMS. For more info see YOLOv5 PyTorch Hub Tutorial

https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/models/common.py#L276-L282

image size: 640

rectangular inference: True

confidence threshold: 0.25

iou threshold: 0.45

multi-label: False

padding: None

Good luck 🍀 and let us know if you have any other questions!

hi, thanks for your work. I'm confused, why test.py and detect.py only load data sets in different ways, but the results are quite different, even if they are using the same detection model.

ultralytics / yolov5