Closed fishawd closed 3 years ago
@fishawd thanks for the bug report! I will try to reproduce.
@fishawd there are differences in the two methods according to your example. First, you should set test.py to --batch 1 as detect.py runs, and then you should set pad = 0.5, to again copy the detect.py dataloader settings. Once you do this results are nearly identical. You also need to realize that results are not constrained to present in the same order, so there is no row similarity expectation.
On COCO128 I get these two results with the above steps for 000000000025.jpg:
23 0.762001 0.49254 0.347197 0.692352 0.801417
23 0.187066 0.873637 0.221204 0.183953 0.676943
23 0.00876289 0.908577 0.0175258 0.148263 0.0224365
23 0.156741 0.918328 0.162197 0.121073 0.0183627
23 0.178919 0.83573 0.132281 0.184333 0.0125276
23 0.178906 0.835681 0.132812 0.183099 0.0125276
23 0.15625 0.91784 0.1625 0.122066 0.0183627
23 0.00859375 0.908451 0.0171875 0.150235 0.0224365
23 0.1875 0.873239 0.221875 0.183099 0.676943
23 0.7625 0.491784 0.346875 0.692488 0.801417
Also all confidences are above threshold. Confidence is the last column. I don't see any bugs.
@glenn-jocher Sorry about the threshold. I misread the txt column beacause different from my habit. I run test.py again but have the same result. pad=0.5 is default in test.pyβs dataloader, I did not modify this parameter.
python test.py --batch 1 --weights ./weights/yolov5x.pt --data data/coco-custom.yaml --img 640 --conf 0.01 --iou 0.45 --save-txt --save-conf
23 0.494141 0.564974 0.271094 0.534635 0.931641
23 0.650391 0.579687 0.126563 0.428125 0.919922
19 0.235156 0.559505 0.170312 0.152865 0.538574
14 0.0256226 0.40625 0.0145752 0.0229167 0.032196
45 0.00708008 0.692187 0.0132813 0.0239583 0.0220184
14 0.00688477 0.69375 0.0133789 0.0239583 0.0187378
19 0.0256226 0.40625 0.0145752 0.0229167 0.0167847
14 0.598047 0.51875 0.0273438 0.0208333 0.0137939
14 0.00786133 0.413542 0.0154297 0.01875 0.0122833
19 0.800781 0.504297 0.0484375 0.0903646 0.0120468
14 0.0770264 0.411198 0.0225098 0.0223958 0.0103607
I also test 000000000025.jpg.
detect
23 0.00703125 0.879108 0.0140625 0.07277 0.0114975
23 0.0078125 0.907277 0.015625 0.138498 0.0186157
23 0.182031 0.899061 0.207813 0.13615 0.82959
23 0.771094 0.49061 0.354688 0.694836 0.943359
test
23 0.771094 0.489583 0.353125 0.699237 0.936523
23 0.181787 0.899648 0.208301 0.134977 0.855957
23 0.00737305 0.869718 0.0147461 0.0492958 0.0108719
Different dataloaders, set padding=0.0 to match and batch=1
Hi Glenn, thanks for the support. I noted that 'pad' is not an argument of def test. If this pad is necessary to obtain comparable test and detect results, would be maybe worth considering to declare 'pad' as an argument of def test .... ?
Best
@bonorico sure! If you'd like please submit a PR with your proposed updates.
@glenn-jocher Thanks! The results are the same when pad=0.0. So why set pad =0.5 default in test.py?
@fishawd this adds padding around the to reduce edge effects from repeated k=3 zero-padded convolutions, which helps COCO mAP slightly.
Why is padding specified in test.py and not in detect.py (at least by looking at code front face) ? Why test and detect are not executing the same operations ? At the end one would still expect test and the detect to do basically the same thing: namely to produce (same) predictions from the trained model .... I will sharing asap a reproducible example ....
@bonorico detect.py and test.py source images from completely different dataloaders with different purposes, one providing labels and the other not.
@glenn-jocher I see. Thanks!
Dear Glenn, thanks a lot for your answer. I still find it very confusing (maybe is my fault): 1) "...one providing labels and the other not.": Actually both test and detect do provide labels with --save-txt (and no groundtruth labels). In which sense do you mean that test.py does not provide labels (if I understood correctly) ? 2) "...different dataloaders with different purposes": I'm still puzzled why test and detect would have different purposes and functionalities. To be clear: my goal is to train the model, test it under the estimated weights, and use these weights to detect objects in video, expecting the same test predictions. Am I saying something strange ? If so, how am I supposed to use "detect" to do mp4 object detection (e.g. predictions) in a way that reliably reflects test performances ? Shall I just use test.py on a frame-by-frame basis to obtain reliable mp4 detections then ? I hope I'm making my point clear here. To facilitate this and also to come back to fishawd question I'm sharing a totally reproducible example showing the big difference between test and detect output. Maybe this example can provide the basis to clarify a correct usage of the respective modules, which yields detect results that are consistent with test predictions. If this criterion is not fulfilled I find it difficult to trust detection results. Best
@bonorico test and detect perform similar functions on the surface, and at one point I was thinking of merging the two into one file, but on the other hand they serve significantly different tasks, and use different dataloaders and use different parameter defaults.
autolabelling is provided by both test.py and detect.py, but is not the primary function of either.
Dear Glenn, thanks a lot for your answer.
It seems to me you are saying that test.py is designed to produce performance metrics under (optimal) conditions, that might not be encountered in real-world usages ?
I have extended the example on colab to use detect.py under default settings (--conf 0.25 --iou 0.45.). This yields many less FPs, roughly the same TPs as test.py, but confidence values are now much lower in detect.py than in test.
My conclusions after executing similar experiments on other data are the following so far:
1) detection does not show the same diagnostic characteirstics declared under test on the same piece of data. Typically detection.py has less mAP.
2) Detection mAP keeps degrading as we let --conf tend to the confidence value predicted under test (this typically high).
For instance, setting detect.py to --conf 0.5 yields up to 15% less mAP than test.py, and going downward for --conf > 0.5. Here detect.py is increasing FNs, confirming that detect.py sensitivity is not the one declared by test.
Hence, the predictor does not work in practice as expected. I don't know how customary this set-up is, but I feel it would confuse a number of people with different background than computer science, say statisticians or else.
One question: Would a medical doctor be confident about these results, when using this method to diagnose cancer severity from cancer-cell counts ? If she did, as an extreme example, she might well send patient at home with cancer, telling them they have none.
Going back to the colab example: "bed" is predicted with high confidence (90%) under test, saying the method is highly sensitive for "bed", which is good if you think at the cancer-cell example (we don't wanna miss out any here). However using the method in practice (detect.py) yields a confidence for "bed" of 74%. This may seems high still, but think again cancer-cell application where we would set --conf as high as 80% (or the certified 90%) to reduce as many FPs as possible. This would now cause to squarely miss out on "bed", increasing the FNs.
It is not sufficient to have a tool that seems to perform well, if its performance cannot be certified in a very consistent manner. My humble suggestion is to precisely declare under which conditions test.py metrics are produced, and to enforce such conditions on detect.py, by default, such to produce very consistent results between the two modules. Unrelated tasks such as auto-labeling should be kept clearly separated.
As a footnote I never flag --augment in my examples, so I guess the same dataloader between test and detect is used ?
If I can provide any help let me know.
FP = false positive, TP = true positive, FN = false negative
@bonorico the current setup follows standard practices. mAP is an all encompassing metric that informs a user of a range of metrics that are achievable.
It is then up to the user to implement their own inference profile based on their domain knowledge and their domain priorities for FP reduction vs FN reduction.
In any case, none of the 'confidences' output by any vision AI system today follow statistical norms for 1-sigma etc. confidence bounds. The nearest you might be able to approach this would be to create your own aposteriori uncertainty profiles using monte carlo methods on empirical results.
As a footnote I never flag --augment in my examples, so I guess the same dataloader between test and detect is used ?
The same dataloader is definitely not used, see previous post.
Hi Glenn, thanks a lot for your answer. I hope to know what mAP is, I just did not know it was standard practice for a detection method to have different performance metrics than those declared under testing. I hope you realize that, if I have to additionally tune detect.py to have high mAP, it implies a different optimization task on top of the one just made during training. Which brings to the main point of this unresolved discussion: I find it just weird that detect.py does not have the same prediction characteristics obtained with train.py. I guess, for the current usage, I will have to just by-pass test.py and simply do my independent mAP computations on detect predictions, to really know what the detector performances are in practice. Best
@bonorico I think you're not understanding the concept. mAP informs a user of achievable deployment metrics across all inference thresholds.
In production is it up to the domain expert to use this PR curve to select one set of inference thresholds. It is not possible to measure a mAP at a single confidence threshold, the idea is meaningless.
Different dataloaders, set padding=0.0 to match and batch=1
I think this issue will help me a lot,so i already tried to reproduce it.but I have a question about this step. the difficulty i met is that how to βcopy the detect.py dataloader settingsβ.Or...It means that changes dataloader settings in test.py into dataloader settings what is like in the detect.py? Looking forward to your reply.Thank you!
@xuke96 π Hello, thanks for asking about the differences between train.py, detect.py and val.py in YOLOv5 π.
These 3 files are designed for different purposes and utilize different dataloaders with different settings. train.py dataloaders are designed for a speed-accuracy compromise, val.py is designed to obtain the best mAP on a validation dataset, and detect.py is designed for best real-world inference results. A few important aspects of each:
640
False
0.001
0.6
True
None
640
True
0.001
0.6
True
0.5 * maximum stride
640
True
0.25
0.45
False
None
YOLOv5 PyTorch Hub models areAutoShape()
instances used for image loading, preprocessing, inference and NMS. For more info see YOLOv5 PyTorch Hub Tutorial
https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/models/common.py#L276-L282
640
True
0.25
0.45
False
None
Good luck π and let us know if you have any other questions!
@xuke96 π Hello, thanks for asking about the differences between train.py, detect.py and val.py in YOLOv5 π.
These 3 files are designed for different purposes and utilize different dataloaders with different settings. train.py dataloaders are designed for a speed-accuracy compromise, val.py is designed to obtain the best mAP on a validation dataset, and detect.py is designed for best real-world inference results. A few important aspects of each:
train.py
- trainloader LoadImagesAndLabels(): designed to load train dataset images and labels. Augmentation capable and enabled. https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/train.py#L210-L213 val_loader LoadImagesAndLabels(): designed to load val dataset images and labels. Augmentation capable but disabled. https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/train.py#L220-L223
- image size:
640
- rectangular inference:
False
- confidence threshold:
0.001
- iou threshold:
0.6
- multi-label:
True
- padding:
None
val.py
- dataloader LoadImagesAndLabels(): designed to load train, val, test dataset images and labels. Augmentation capable but disabled.
- https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/val.py#L152-L153
- image size:
640
- rectangular inference:
True
- confidence threshold:
0.001
- iou threshold:
0.6
- multi-label:
True
- padding:
0.5 * maximum stride
detect.py
- dataloaders (multiple): designed for loading multiple types of media (images, videos, globs, directories, streams). https://github.com/ultralytics/yolov5/blob/7ee5aed0b3fa6a805afb0a820c40bbcbf29960de/detect.py#L120-L128
- image size:
640
- rectangular inference:
True
- confidence threshold:
0.25
- iou threshold:
0.45
- multi-label:
False
- padding:
None
YOLOv5 PyTorch Hub Inference
YOLOv5 PyTorch Hub models are
AutoShape()
instances used for image loading, preprocessing, inference and NMS. For more info see YOLOv5 PyTorch Hub Tutorial
- image size:
640
- rectangular inference:
True
- confidence threshold:
0.25
- iou threshold:
0.45
- multi-label:
False
- padding:
None
Good luck π and let us know if you have any other questions!
hi, thanks for your work. I'm confused, why test.py and detect.py only load data sets in different ways, but the results are quite different, even if they are using the same detection model.
Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:
π Bug
Different txt result between test and detect when use
--save-txt
commandTo Reproduce (REQUIRED)
image: coco2017val 000000581100.jpg detect:
txt result:
test: coco-custom.yaml
txt result:
As shown above, not only the results of txt are different, but also some results smaller than the threshold are saved into txt.
Environment