Closed khalidw closed 3 years ago
@khalidw mAP is automatically computed using test.py after every epoch during training. See Train Custom Data tutorial to get started:
@glenn-jocher Thanks for your response but it seems like my query was unclear.
I am hoping that custom trained model would have better mAP score than the pre-trained model for my custom dataset
@khalidw you can use any model you want with test.py by passing it with the --weights argument:
python test.py --data your_data.yaml --weights any_model.pt
If the model was trained on the data, or has intersecting classes with your dataset then you should get some nonzero mAP result.
@glenn-jocher this works perfectly fine with my custom trained models. But I also want to get mAP for pretrained models so that a comparison can be made.
Although mAP for pretrained models are given, but the provided values are for VOC and COCO dataset. I want to generate mAP for pretrained models on my custom dataset.
When I tried to run the test.py for a pretrained model using my custom dataset (boats) I ran into an error. I wasn't expecting this error as my custom dataset and the pretrained models have an intersecting class (boats).
I added an entry in the localDataset/localDataset.yaml
as test: localDataset/test.txt
which points to the location of test images
I am sharing both the results for custom trained model (error free) and pretrained model (with an error)
!python test.py --weights yolov5x.pt --data localDataset/localDataset.yaml --img 640
Namespace(augment=False, batch_size=32, conf_thres=0.001, data='localDataset/localDataset.yaml', device='', exist_ok=False, img_size=640, iou_thres=0.6, name='exp', project='runs/test', save_conf=False, save_hybrid=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=False, weights=['yolov5x.pt'])
YOLOv5 v4.0-20-ge8a41e8 torch 1.7.0+cu101 CUDA:0 (Tesla T4, 15109.75MB)
Fusing layers...
Model Summary: 476 layers, 87730285 parameters, 0 gradients, 218.8 GFLOPS
val: Scanning 'localDataset/labels.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100% 100/100 [00:00<00:00, 731990.23it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 0% 0/4 [00:01<?, ?it/s]
Traceback (most recent call last):
File "test.py", line 321, in <module>
save_conf=opt.save_conf,
File "test.py", line 182, in test
confusion_matrix.process_batch(pred, torch.cat((labels[:, 0:1], tbox), 1))
File "/gdrive/My Drive/object_detection/YOLOv5/utils/metrics.py", line 146, in process_batch
self.matrix[gc, detection_classes[m1[j]]] += 1 # correct
IndexError: index 8 is out of bounds for axis 1 with size 2
# yolov5x
!python test.py --weights runs/train/exp3/weights/best.pt --data localDataset/localDataset.yaml --img 640
Namespace(augment=False, batch_size=32, conf_thres=0.001, data='localDataset/localDataset.yaml', device='', exist_ok=False, img_size=640, iou_thres=0.6, name='exp', project='runs/test', save_conf=False, save_hybrid=False, save_json=False, save_txt=False, single_cls=False, task='val', verbose=False, weights=['runs/train/exp3/weights/best.pt'])
YOLOv5 v4.0-20-ge8a41e8 torch 1.7.0+cu101 CUDA:0 (Tesla T4, 15109.75MB)
Fusing layers...
Model Summary: 476 layers, 87198694 parameters, 0 gradients, 217.1 GFLOPS
val: Scanning 'localDataset/labels.cache' for images and labels... 100 found, 0 missing, 0 empty, 0 corrupted: 100% 100/100 [00:00<00:00, 1061849.11it/s]
Class Images Targets P R mAP@.5 mAP@.5:.95: 100% 4/4 [00:04<00:00, 1.03s/it]
all 100 304 0.731 0.947 0.897 0.463
Speed: 19.7/2.4/22.0 ms inference/NMS/total per 640x640 image at batch-size 32
Results saved to runs/test/exp2
@khalidw you can only test models on datasets with identical classes. You test a COCO trained model on COCO only.
why doesnt the sum add up to 1 for Predicted scores in confusion matrix ?
@Akhp888 the columns are normalized in the confusion matrix, not the rows.
There is also a PR #2114 open with a slightly different confusion matrix implementation that you may want to look at.
@Akhp888 the columns are normalized in the confusion matrix, not the rows.
There is also a PR #2114 open with a slightly different confusion matrix implementation that you may want to look at.
Thanks for the reply @glenn-jocher Taking into consideration about the normalization you mentioned I still find it hard to interpret the Background FN /FP , can you help me with generalising the scores ? Also I tried with PR #2114 where i see the axis of miscalculation has reversed but still not something which i could relate .
Thanks
@Akhp888 yeah don't worry, the confusion matrix is certainly pretty confusing. In general everyone is used to seeing classification confusion matrices, which are simpler due to lack of the background class we have here.
I'm not sure about row and column normalization both at the same time. Is that even possible? i.e. is there a linear closed form solution for that?
You can try to test at a few different confidence levels to understand how the confusion matrix works (i.e. 0.001, 0.1, 0.9)
@Akhp888 yeah don't worry, the confusion matrix is certainly pretty confusing. In general everyone is used to seeing classification confusion matrices, which are simpler due to lack of the background class we have here.
I'm not sure about row and column normalization both at the same time. Is that even possible? i.e. is there a linear closed form solution for that?
You can try to test at a few different confidence levels to understand how the confusion matrix works (i.e. 0.001, 0.1, 0.9)
I was trying to figure out any way by which i could have same normalization for ground truth values and predicted when i noticed that the inbuilt xyxy2xywh and xywh2xyxy giving me different value than what is expected. eg : for xyxy = [5994.9658203125, 1397.2547607421875, 6290.80908203125, 1770.0487060546875] with gn = [32579,2048,32579,2048] i got xywh = [0.1885536015033722, 0.7732674479484558, 0.009080796502530575, 0.18202829360961914]
where as by manual calculation i was supposed to get [0.1840131931708309, 0.6822533011436462, 0.009080796271179288, 0.18202829360961914] notice the difference in y value ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@khalidw you can only test models on datasets with identical classes. You test a COCO trained model on COCO only.
So does this mean that yolov5 cannot be used for unseen data from another totally different dataset?
@edervishaj YOLO models can be trained on any detection dataset and applied to any images. It's your responsibility to ensure that your deployed image space and your training image space are sufficiently similar to achieve a sufficient generalization capability.
Thank you for your reply.
YOLO models can be trained on any detection dataset and applied to any images
Just like OP, I tried running test.py using pretrained weights on a dataset with intersecting classes but ran into an error similar to the one shown by @khalidw.
@edervishaj I don't understand what you're asking. For directions creating a proper dataset with train, val, test subsets for training and testing using this repo please see the Train Custom Data tutorial to get started.
@edervishaj I don't understand what you're asking. For directions creating a proper dataset with train, val, test subsets for training and testing using this repo please see the Train Custom Data tutorial to get started.
YOLOv5 Tutorials
- Train Custom Data 🚀 RECOMMENDED
- Tips for Best Training Results ☘️ RECOMMENDED
- Weights & Biases Logging 🌟 NEW
- Supervisely Ecosystem 🌟 NEW
- Multi-GPU Training
- PyTorch Hub ⭐ NEW
- ONNX and TorchScript Export
- Test-Time Augmentation (TTA)
- Model Ensembling
- Model Pruning/Sparsity
- Hyperparameter Evolution
- Transfer Learning with Frozen Layers ⭐ NEW
- TensorRT Deployment
@glenn-jocher I think what they mean is that they have a set of test images (say they consist of 2 classes: car, truck) both of which are included in the COCO dataset, now they want to run the test.py but instead of using a custom model that is trained on both of these classes, instead they want to achieve the mAP on the test images by using the default yolov5s.pt which was trained on all 80 classes.
@glenn-jocher I have created a custom model for 2 classes (good and defective). In all I have split my dataset into 3 parts : train, val and test
When I use
python /content/yolov5/test.py --weights /content/yolov5/runs/train/exp3/weights/best.pt --data coco128.yaml --img 640 --augment --half --conf-thres 0.5 --device 0
and my coco128.yaml is
# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: /content/drive/MyDrive/Spacer_crop_copy # dataset root dir
train: images/train # train images (relative to 'path') 128 images
val: images/val # val images (relative to 'path') 128 images
test: /content/drive/MyDrive/test # test images (optional)
# Classes
nc: 2 # number of classes
names: [ 'Defective', 'Good' ] # class names
Now the script is only taking val folder into consideration what I want detailed metrics of test folder. Please help
@Ankit-Vohra Simply replace test
with val
to generate metrics for the test data.
@Ankit-Vohra you can use the --task
argument with test.py to point it to the split you are interested in evaluating:
https://github.com/ultralytics/yolov5/blob/d204a61834d0f6b2e73c1f43facf32fbadb6b284/test.py#L315
i.e.
python test.py --task test
The test.py file seems to have been removed from the repository. Is it alright to use val.py with --task test
instead to obtain evaluation metrics on the test set?
Yes
Yes
The val.py doesn't seem to accept --task test ?
@Latzi yes, you are correct. The val.py
script does not have a --task test
argument. However, you can still use val.py
to evaluate your model on the test set by running the script without any additional arguments. The script will automatically detect the test set based on your dataset configuration file (--data
argument) and generate evaluation metrics for the test set.
@Latzi yes, you are correct. The
val.py
script does not have a--task test
argument. However, you can still useval.py
to evaluate your model on the test set by running the script without any additional arguments. The script will automatically detect the test set based on your dataset configuration file (--data
argument) and generate evaluation metrics for the test set.
I did run it by adding --task val . I assume the set evauated was the dataset in the val folder ? In my yaml file I have specified train, val and test. So if I understand correctly what you said if I simply run this !python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml the evauated set will be the data specified in the yaml under the test entry ? Is that correct ?
@Latzi yes, that is correct. If you run val.py
without any additional arguments, it will automatically detect the test set based on your dataset configuration file (--data
argument) and generate evaluation metrics for that test set. In your case, since you have specified train, val, and test subsets in your YAML file, running the command python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml
will evaluate the model on the test set specified in your YAML file.
@Latzi yes, that is correct. If you run
val.py
without any additional arguments, it will automatically detect the test set based on your dataset configuration file (--data
argument) and generate evaluation metrics for that test set. In your case, since you have specified train, val, and test subsets in your YAML file, running the commandpython val.py --weights runs/train/exp/weights/best.pt --data Person.yaml
will evaluate the model on the test set specified in your YAML file.
Hi @glenn-jocher . Thanks for your answer. It is all clear now :-)
@Latzi you're welcome! I'm glad I could help clarify things for you. If you have any further questions or need any more assistance, feel free to ask. Good luck with your project!
@Latzi you're welcome! I'm glad I could help clarify things for you. If you have any further questions or need any more assistance, feel free to ask. Good luck with your project!
Hi @glenn-jocher . Something's up. It doesnt make sense. I did run the val.py without the --task test as we discussed and it did run and all good however when I look at the results I got results on 864 images with 82 instances which are the exact number of images and annotated class instances I have in the val folder ? the test folder has 975 images with 102 instances. So the val.py did run on the images in the val folder not the test folder by the looks of it ? Running the " !python val.py --weights runs/train/exp/weights/best.pt --data Person.yaml" line returns results from the val folder. My yaml file looks like this :
train: ../train_data_0123_4/images/train/ # train images val: ../train_data_0123_4/images/val/ # val images test: ../train_data_0123_4/test/
nc: 1 # number of classes names: ['Person']
Please help me out with this one
@glenn-jocher If I swap the val with test in the yaml file then the val.py will execute the test folder as I can see the correct number of images and instances now. Also the results obtained by running the val.py on the test folder are significantly higher that if I run the val.py on the val folder. Really big difference ?
Like for the test folder I get val: Scanning /content/train_data_4012_3/test/labels.cache... 975 images, 873 backgrounds, 0 corrupt: 100% 975/975 [00:00<?, ?it/s] Class Images Instances P R mAP50 mAP50-95: 100% 31/31 [00:07<00:00, 4.00it/s] all 975 107 0.999 0.991 0.995 0.963 Speed: 0.1ms pre-process, 2.3ms inference, 0.7ms NMS per image at shape (32, 3, 640, 640) Results saved to runs/val/exp3
while for the val folder the results are : val: Scanning /content/train_data_4012_3/labels/val.cache... 864 images, 782 backgrounds, 0 corrupt: 100% 864/864 [00:00<?, ?it/s] Class Images Instances P R mAP50 mAP50-95: 100% 27/27 [00:06<00:00, 4.35it/s] all 864 82 0.823 0.415 0.509 0.298 Speed: 0.1ms pre-process, 2.2ms inference, 0.8ms NMS per image at shape (32, 3, 640, 640)
Is that normal ? Or I am making some fundamental mistake? The files in the test folder were never seen by the model during training. (I am running a 5 fold cross vlaidation experiment and this 5 th fold is used for the test folder )
@Latzi hi there! It seems you're experiencing a difference in the evaluation metrics when running val.py
on the "val" and "test" folders using your YAML file. Just to clarify, the "test" folder contains images that were not seen by the model during training.
The reason for the discrepancy in results could be because the model has not been exposed to the test images during training, so it may not have learned to generalize well on this specific unseen data. This can lead to lower performance in terms of precision, recall, and mAP.
Furthermore, please note that evaluation metrics can vary based on the distribution and complexity of the data in the respective folders. It's possible that the test data contains more challenging instances or different scenarios compared to the validation data, resulting in lower performance metrics.
It's essential to evaluate the model on unseen data to gauge its performance on new and unseen samples. The results obtained from evaluating the "test" folder provide a better indication of how the model will perform in real-world scenarios.
If you have any further questions or need additional assistance, please feel free to ask.
@glenn-jocher . The dataset has been split in 5 equal parts. Totally ramndom. One of these parts is called test and has been loaded into the colab env and the yaml file points to tis folder. This test folder has images and lables folder inside. It appears when I test the model I get higher performance metrics (which I am noy complaining about :-) ) except that I dont understand still why this huge difference ? I mean the files I have run the val.py script on (the val folder) has the images which were used during training as te validation set yet I am getting lower mAP values by a lot as you can see above. I don't quite understand why that is. Also the only way to get the val.py to analyze the files in the test folder (unseen by training) is to swap the val and test paths ... That will make the val.py look inside the test folder instead.
I just want to double , triple ,qadruple check that the files in the test folder are not used for training or validation purposes . Right ? Even kf they are inside the train_data folder? the training script only uses train and val folders. Right ?
@Latzi the test folder should indeed contain images and labels that were not used during training or validation. It's important for evaluation purposes to have a separate set of unseen data to assess the model's performance on new samples. Swapping the paths in the YAML file (putting the test path in the val field) allows val.py
to analyze the files in the test folder.
Regarding the difference in performance metrics between the val and test folders, this can be influenced by various factors. The test data might contain different instances or scenarios compared to the validation data, leading to varying results. The model might not have seen the specific patterns or variations present in the test set, resulting in lower mAP values.
To further investigate, you can examine the specific instances where the model is struggling in the test set. Analyzing false positives and false negatives can provide insights into potential areas for improvement.
Rest assured that the training script only uses the train and val folders for training and validation purposes, respectively. The test folder remains completely unseen by the training process, which aligns with the standard evaluation setup.
If you have any additional questions or concerns, please let me know.
@glenn-jocher . Thank you very uch for yoru anwers. The test set performs fantastic while the val.py run on the validation set give much lower results. Really weird. Also in the val.py I am assuming that a series of evaluations are performed on the images then the detections are compared to the annotations and the numbers compiled at the end forming the values which are the outut of the val.py. The only question is during these evaluations what the confidence level is set for the model ? Whatever was the F1 value during traing ? It is surely not arbitrary. That is the last question. I promisse :-) . I really appreciate your time and help it is awesome !
@Latzi the test set performing better than the validation set is indeed an interesting observation. There can be various reasons for this difference, such as the test set containing more challenging instances or different scenarios compared to the validation set. It could also be due to the model not being exposed to the specific patterns or variations present in the test set during training.
Regarding your question about the confidence level during evaluations in val.py
, the confidence threshold used for evaluating the model's detections is not based on the F1 value during training. By default, val.py
uses a confidence threshold of 0.001, which can be adjusted using the --conf
argument. This threshold determines the minimum confidence level required for a detection to be considered valid during evaluation.
I'm glad I could be of help, and I appreciate your kind words. If you have any more questions or need further assistance, please feel free to ask.
Hi @glenn-jocher . Yes I saw the conf value as being 0.001 in the val.py code. But that confuses me now further. As that seems to be super low. Wouldn't that mean that the model will see objects of itereset where there are no such objects? Increasing the false positives trough the roof? I mean during training I had an F1 maxig out at lets say 0.45 . So in my mind if I set the conf factor to 0.45 I should have a realistic picture of what the model performace would be. I mean if I set the model to do the detect.py with a confidence being 0.001 I'd have an avalance of detections most of which would be false positives.
@Latzi the confidence threshold used during evaluation in val.py
is a value that determines the minimum confidence level required for a detection to be considered valid. A lower threshold, such as 0.001, means that even detections with very low confidence scores will be considered valid.
Setting the confidence threshold too low can indeed result in an increase in false positives, as the model may detect objects where there are none or where the confidence score is very low. It's important to strike a balance when choosing the confidence threshold based on your specific requirements and the desired trade-off between false positives and false negatives.
If you set the confidence factor to 0.45, as the F1 score suggests, it would be a more realistic threshold to consider the model's performance. This would filter out detections with confidence scores lower than 0.45 and provide a clearer picture of the model's accuracy.
I hope this clarifies the role of the confidence threshold in evaluating the model's performance. If you have any further questions, please feel free to ask.
@glenn-jocher That is what I was saying. Setting the confidence treshhold to a more realistic value / closer to what the F1 factor suggest is a better way of testing the performance than having it at the default 0.001. I am doing a 5 fold cross validation exercise and if I run the val.py on 0.001 all will turn into an amorph mass. Running the val.py without setting a conf value whch is realistic is where the confusion came from. It the colab file the vval.py line doesn't have a confidence parameter and I , wrongly , assumed that the confidence would be whatever the F1 suggested it should be as that would give a realistic result. In saying that if that was the case then running the val.py with previously trained models (if the session is staill active) would create another issue because for that particular model the F1 conf might be very different. So for this 5 fold validation exercise I will probably work out the average F1 for the 5 models, and run the val.py for each model on the test set (data not seen during training) and then compare the results and work out the P,R & maP's . The data I have is homogenous / same chunk randomized and it was divided in 5 equal parts where 4 parts used for training and val and the 5th for testing. Given the homogenous nature of the data I expect similar model performances. But it is all good now. Thank you for clarifying, for your time and patience.
I mean I kept reffering to the F1 score , which is not really meaningful really to determine the best confidence level, but for unbalanced class datasets (which is my case) in the past projects I found that a confidence factor near the value the F1 score maxes out is always a good starting point. In any case the n-fold validation results should be compared on the same confidence treshhold to get some meaningful results . All is clear thank you again :-) .
@Latzi hi there,
Setting a confidence threshold that aligns with the F1 score is indeed a good starting point, especially for unbalanced class datasets. It can provide a more realistic assessment of the model's performance. In your case, running the validation script with a confidence threshold closer to the F1 score max can help generate meaningful results.
Comparing n-fold validation results using the same confidence threshold is a valid approach to evaluate and compare the performance of different models. It ensures consistency in the evaluation process and allows for a fair comparison.
I'm glad I could help clarify the confusion, and if you have any further questions or need assistance, feel free to ask.
Thank you for reaching out and have a great day!
❔Question
Hi! I have a custom dataset of boats (only one class). I have labelled the data myself. I was wondering how can I generate mAP metrics for this data using the pretrained weights (yolov5x.pt).
I understand that test.py can be used to generate mAP metrics but I cannot figure out how can this be done for my own labelled dataset.
Additional context