Confusion matrix and other detailed analyses on a test set

yasindagasan commented 4 years ago

🚀 Feature

Confusion matrix outlining class confusions
Error analyser
Plot of objects inside labels in feature space

Motivation

I am using object detection on a custom dataset. We labelled quite a number of images and we are interested in detecting the same type of an object i.e. fractures on bones. We set some rules to assign classes to the objects based on how they look. However, some classes have quite a number of borderline examples that are easy to be confused by the model. I thought a confusion matrix would be pretty handy to assess the model performances and update my labels. I would be very keen to have these features in the codes if possible:

1. Confusion matrix: I would like to know what classes are getting confused so that I can maybe correct my labels or merge some of the confused classes as a single class.

2. Error analyser: It would also be good to further analyse my labels and obtain potential improvements (i.e. revising the bounding boxes to better define the object boundaries etc.). Some of the object boundaries are not clear and different people can label the boundaries differently.

3. Plotting objects in feature space for similarity and outlier analysis: I would also be very much interested in knowing the problematic labels in the datasets. I was thinking maybe we can use the trained weights to extract features and do further dimensionality reduction using umap colored by classes.

Let me know what you think. I am happy to discuss it further!

glenn-jocher commented 4 years ago

@yasindagasan thanks for the suggestions! The fastest way you might be able to introduce some of these features would be to try them out yourself and submit a PR with your suggested updates. They are good ideas but unfortunately we are quite saturated at the moment maintaining the repo and backpropagating recent updates to ultralytics/yolov3 soon.

I've seen confusion matrices requested before, but I'm not sure if people realize that these are generally only created for classification tasks. It's not clear to me how this would extend to object detection. Do you have any references for confusion matrices applied explicitly to object detection results?
may be more in the domain of labelling tools, though we have also had requests before for dataset visualization tools, which is definitely a missing feature. I wonder if we could use something like plotly dashboard to put together an interactive visualizer. Updating and modifying the labels would be a bit above and beyond this, but I agree a viewer at the minimum is needed.
I don't quite follow, could you show some examples of this?

yasindagasan commented 4 years ago

@glenn-jocher thanks very much for the reply! Yes I agree and can definitely understand your workload!

I agree with you in the sense that confusion matrices are more suitable for classification tasks. Although we might not obtain the exact useful information as in the classification confusion matrices, I was thinking it can still provide some useful information. I have come across the following codes that attempted to obtain a confusion matrix for object detection:

Code 1 (tensorflow) Code 2

I will create a PR if I happen to have something out of these.

Plotly dashboard seems to be a good idea. We can also use some other available tools too but I am not sure how easy it would be to integrate them. I am currently using labellimg and CVAT tools for labelling the datasets. CVAT seems to be quite nice and there is an option to provide a model to do auto labelling. In terms of the difficulty of setting up, labelimg is pretty straightforward. CVAT needs a bit of time to configure.
Sorry it was not clear. Maybe I can better explain by an example.

This part was more about uncertain labels. I sometimes work on image classification tasks where we have labels which were labelled by people from the domain. Although the labels are created by people with an understanding of the domain, due to the complexities of the objects in an image, labellers can make mistakes. Some images are at the borderline of two classes and can be difficult to differentiate by a naked eye. These examples are pretty easy to be confused by labellers.

What we generally do to handle such cases is to train a model (i.e. resnet50) and use trained weights to do feature extraction. We remove the last layer of the network and for a given image we obtain a vector of features with a length of 512 or 2048. To visualise the similarity and also detect outliers, we then do a dimensionality reduction using UMAP and reduce the dimension into 2 or 3 dimensions. Points closer in this created feature space is expected to be more similar to those which are far away. We then colour the points by the provided labels and detect images which were labelled wrongly. We then do necessary corrections on the labels. Below image can be an example for a feature space.

I was wondering if you think such an approach could be applicable for object detection somehow. I have previously attempted to do this but did not have time to experiment more on it. What I did was using the bounding boxes predicted by YOLO, I cropped the images and put them in separate images with a standard size using resizing and padding. I then trained a fastai model to obtain weights and did dimensionality reduction and clustering using DBSCAN.

One of the problems I encountered was that the objects can be in varying aspect ratio and resizing them might sometimes lead to cropping them or destroying the aspect ratio. I am not sure about its applicability for object detection but just wanted to discuss and get your thoughts on it.

glenn-jocher commented 4 years ago

@yasindagasan interesting. I've raised an issue on https://github.com/kaanakan/object_detection_confusion_matrix/issues/6 to ask the author if he might help with integration.

As you've noticed in 3, aspect ratio modifications from stretching and other considerations complicate box extraction and classification/detection interoperability.

We are actually working on a YOLOv5 classifier though, so this may be suitable for that. The classifier is very easy to build, it's simply a YOLOv5 backbone with a Classify() head: https://github.com/ultralytics/yolov5/blob/8d2d6d2349cc4732667888435e9f01912d80a4ba/models/common.py#L227-L237

Ownmarc commented 4 years ago

Your idea 2 and 3 are in line with the suggestion I made here: https://github.com/ultralytics/yolov5/issues/895

I really think this deserves some thoughts. I understand this would not really be of much use to increase mAP on Coco since you can't really change the labels, but the labels for custom model on custom datasets are almost always to place to start when you want to get better results from your Yolov5 models

glenn-jocher commented 4 years ago

@yasindagasan @Ownmarc 1 and 2 are definitely more closed-ended feasible ideas that may be implemented. 1 in particular can be plotted at the same time as the PR curve using the same TP and FP vectors perhaps. I recently updated a few plots BTW in https://github.com/ultralytics/yolov5/pull/1432 and https://github.com/ultralytics/yolov5/pull/1428, including the PR curves and the labels plots for better introspection.

3 is a bit more open ended, but I understand the desire for better failure mode analysis and post training introspection tools. This is somewhat in the same direction as active learning, or adapting your labels based on training feedback. I'll have to think about it.

One update for post training analysis is that you can use a confidence slider on Weights & Biases results to help you determine a best confidence threshold for deployment. This is rather new and useful, but mainly suitable for just that one task of determining a best real-world confidence threshold to use. You can see an example here (click the gear on the Media panel): https://wandb.ai/glenn-jocher/yolov5_tutorial/reports/YOLOv5-COCO128-Tutorial-Results--VmlldzozMDI5OTY

yasindagasan commented 4 years ago

@Ownmarc yes, very related with your #895 suggestions. Sorry I was not aware of that thread. I also experienced that labels on custom datasets have a huge impact on success. Any improvement on that would be very helpful.

@glenn-jocher I just have seen your recent updates. I like the PR curve colored by classes in #1428, it is very useful!

Classifier might sound suitable actually. I will be experimenting with it soon.

glenn-jocher commented 4 years ago

@yasindagasan @Ownmarc I've integrated a confusion matrix now into test.py. See PR https://github.com/ultralytics/yolov5/pull/1474

There's some unfortunate overlap in the computations going on inside the confusion matrix class and the mAP computation code, particularly in that they both compute IoU matrices separately (duplication of effort), but this will have to do for now. The confusion matrix adds about 5-10 seconds of wall clock time to test.py, ie. a typical YOLOv5m COCO test.py run will now take 1:25, up 10 seconds from 1:15 before.

yasindagasan commented 4 years ago

this is awesome thanks @glenn-jocher!

glenn-jocher commented 4 years ago

@yasindagasan I'll leave this issue open as https://github.com/ultralytics/yolov5/pull/1474 only partially satisfies the feature additions.

After considering the results a bit, I think unfortunately the conclusions you can draw from the confusion matrices in object detection may be somewhat limited, as it seems that by far the largest cross-class confusion is simply class {x} to background, regardless of x.

Still, any extra information should help everyone understand their results a bit better :)

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mansi-aggarwal-2504 commented 3 years ago

I was trying to obtain the confusion matrix using test.py after referring to this PR, and I'm using Colab. How can I obtain the matrix for my custom dataset? Do I have to run the test.py script and mention a parameter explicitly?

Edit: if the matrix is produced automatically at the end of training as mentioned here, how can I save it if in case I can'r visualise it in Colab? Also, I would want to produce the matrix was my testing data that I parse in detect.py. I have the predictions and I have the ground truth, can someone guide me so that I can do that?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 test.py automatically generates confusion matrices. Results are logged to the directory indicated, i.e. runs/test/exp

mansi-aggarwal-2504 commented 3 years ago

@mansi-aggarwal-2504 test.py automatically generates confusion matrices. Results are logged to the directory indicated, i.e. runs/test/exp

Thank you @glenn-jocher for such a prompt response! I must be doing something, I found it. This must be the matrix for validation data and I want to do the same thing for my test data with predictions and ground truth. Shall I make yaml file and point it towards this test dataset or is there a more efficient way? Also I want to understand how to interpret the matrix. I have a single class i.e. flower. confusion_matrix I understand the first column but I don't get the second column i.e. background FP mapped with flower with value 1.0 What does the second column him?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 you can evaluate test.py to any part of your dataset (train, val, test) using the --task flag: https://github.com/ultralytics/yolov5/blob/7b36e38cf8f3d3c08e973b18913ae8e41ff970b2/test.py#L297

glenn-jocher commented 3 years ago

The matrix indicates that 100% of your background FPs are caused by the flower category.

mansi-aggarwal-2504 commented 3 years ago

@glenn-jocher thank you very much. I will use the --task flag.

Also:

The matrix indicates that 100% of your background FPs are caused by the flower category.

Got it, thanks!

mansi-aggarwal-2504 commented 3 years ago

@mansi-aggarwal-2504 you can evaluate test.py to any part of your dataset (train, val, test) using the --task flag:

https://github.com/ultralytics/yolov5/blob/7b36e38cf8f3d3c08e973b18913ae8e41ff970b2/test.py#L297

I set the task to test and uploaded the ground truth of the test set. I received one confusion matrix. So this is the average of all images in my test set?

confusion_matrix-2

Is there a way to get a separate matrix for all images?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 that is correct, a set of images generates one confusion matrix.

mansi-aggarwal-2504 commented 3 years ago

@mansi-aggarwal-2504 that is correct, a set of images generates one confusion matrix.

Is there a way to get a separate matrix for all images?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 the confusion matrix already applies to all images.

mansi-aggarwal-2504 commented 3 years ago

@glenn-jocher but the resultant matrix is like an average result of images in the test set, right?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 yes one confusion matrix is generated for the entire dataset.

Mohamed-Elredeny commented 3 years ago

Can someone explain this for me

glenn-jocher commented 3 years ago

@Mohamed-Elredeny see https://en.wikipedia.org/wiki/Confusion_matrix

mansi-aggarwal-2504 commented 3 years ago

I noticed that when I run the test.py, the total number of objects detected is 4718 i.e. TP + FP confusion_matrix

But the total number of objects detected by detect.py was 4605. I also ran detect.py with same conf_thres and iou_thres values as default test.py values i.e. --iou-thres 0.6 --conf-thres 0.001 but the count is still different. How should I change the parameters in detect.py to get the same results as test.py because I want the object count and TP count as test.py.

EDIT: 300 flowers in all images when params are kept same as test.py

I also tried running test.py with --iou-thres=0.45 --conf-thres=0.25 which are the default for detect.py but there is still a difference between number of objects detected.

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 see test.py and metrics.py for TP and FP computation.

priyankadank commented 2 years ago

@mansi-aggarwal-2504 test.py automatically generates confusion matrices. Results are logged to the directory indicated, i.e. runs/test/exp

Thank you @glenn-jocher for such a prompt response! I must be doing something, I found it. This must be the matrix for validation data and I want to do the same thing for my test data with predictions and ground truth. Shall I make yaml file and point it towards this test dataset or is there a more efficient way? Also I want to understand how to interpret the matrix. I have a single class i.e. flower. I understand the first column but I don't get the second column i.e. background FP mapped with flower with value 1.0 What does the second column him?

I am getting similar confusion matrix. I have following questions: 1. I did not understand fourth value i.e. Bottom right. Can you please explain? 2. How to decrease third value (Up Right) i.e. 1.0? 3. How to increase fourth value i.e. Bottom right?

glenn-jocher commented 2 years ago

@priyankadank 👋 Hello! Thanks for asking about improving YOLOv5 🚀 Training results. columns are normalized.

Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.

If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name directory, typically yolov5/runs/train/exp.

We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.

Dataset

Images per class. ≥ 1500 images per class recommended
Instances per class. ≥ 10000 instances (labeled objects) per class recommended
Image variety. Must be representative of deployed environment. For real-world use cases we recommend images from different times of day, different seasons, different weather, different lighting, different angles, different sources (scraped online, collected locally, different cameras) etc.
Label consistency. All instances of all classes in all images must be labelled. Partial labelling will not work.
Label accuracy. Labels must closely enclose each object. No space should exist between an object and it's bounding box. No objects should be missing a label.
Label verification. View train_batch*.jpg on train start to verify your labels appear correct, i.e. see example mosaic.
Background images. Background images are images with no objects that are added to a dataset to reduce False Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for reference, 1% of the total). No labels are required for background images.

Model Selection

Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.

YOLOv5 Models

Start from Pretrained weights. Recommended for small to medium sized datasets (i.e. VOC, VisDrone, GlobalWheat). Pass the name of the model to the --weights argument. Models download automatically from the latest YOLOv5 release.

python train.py --data custom.yaml --weights yolov5s.pt
                                         yolov5m.pt
                                         yolov5l.pt
                                         yolov5x.pt
                                         custom_pretrained.pt

Start from Scratch. Recommended for large datasets (i.e. COCO, Objects365, OIv6). Pass the model architecture yaml you are interested in, along with an empty --weights '' argument:

python train.py --data custom.yaml --weights '' --cfg yolov5s.yaml
                                                  yolov5m.yaml
                                                  yolov5l.yaml
                                                  yolov5x.yaml

Training Settings

Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.

Epochs. Start with 300 epochs. If this overfits early then you can reduce epochs. If overfitting does not occur after 300 epochs, train longer, i.e. 600, 1200 etc epochs.
Image size. COCO trains at native resolution of --img 640, though due to the high amount of small objects in the dataset it can benefit from training at higher resolutions such as --img 1280. If there are many small objects then custom datasets will benefit from training at native or higher resolution. Best inference results are obtained at the same --img as the training was run at, i.e. if you train at --img 1280 you should also test and detect at --img 1280.
Batch size. Use the largest --batch-size that your hardware allows for. Small batch sizes produce poor batchnorm statistics and should be avoided.
Hyperparameters. Default hyperparameters are in hyp.scratch-low.yaml. We recommend you train with default hyperparameters first before thinking of modifying any. In general, increasing augmentation hyperparameters will reduce and delay overfitting, allowing for longer trainings and higher final mAP. Reduction in loss component gain hyperparameters like hyp['obj'] will help reduce overfitting in those specific loss components. For an automated method of optimizing these hyperparameters, see our Hyperparameter Evolution Tutorial.

ultralytics / yolov5