ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.69k stars 16.33k forks source link

Confusion matrix and other detailed analyses on a test set #1402

Closed yasindagasan closed 3 years ago

yasindagasan commented 3 years ago

🚀 Feature

  1. Confusion matrix outlining class confusions
  2. Error analyser
  3. Plot of objects inside labels in feature space

Motivation

I am using object detection on a custom dataset. We labelled quite a number of images and we are interested in detecting the same type of an object i.e. fractures on bones. We set some rules to assign classes to the objects based on how they look. However, some classes have quite a number of borderline examples that are easy to be confused by the model. I thought a confusion matrix would be pretty handy to assess the model performances and update my labels. I would be very keen to have these features in the codes if possible:

1. Confusion matrix: I would like to know what classes are getting confused so that I can maybe correct my labels or merge some of the confused classes as a single class.

2. Error analyser: It would also be good to further analyse my labels and obtain potential improvements (i.e. revising the bounding boxes to better define the object boundaries etc.). Some of the object boundaries are not clear and different people can label the boundaries differently.

3. Plotting objects in feature space for similarity and outlier analysis: I would also be very much interested in knowing the problematic labels in the datasets. I was thinking maybe we can use the trained weights to extract features and do further dimensionality reduction using umap colored by classes.

Let me know what you think. I am happy to discuss it further!

glenn-jocher commented 3 years ago

@yasindagasan thanks for the suggestions! The fastest way you might be able to introduce some of these features would be to try them out yourself and submit a PR with your suggested updates. They are good ideas but unfortunately we are quite saturated at the moment maintaining the repo and backpropagating recent updates to ultralytics/yolov3 soon.

  1. I've seen confusion matrices requested before, but I'm not sure if people realize that these are generally only created for classification tasks. It's not clear to me how this would extend to object detection. Do you have any references for confusion matrices applied explicitly to object detection results?

  2. may be more in the domain of labelling tools, though we have also had requests before for dataset visualization tools, which is definitely a missing feature. I wonder if we could use something like plotly dashboard to put together an interactive visualizer. Updating and modifying the labels would be a bit above and beyond this, but I agree a viewer at the minimum is needed.

  3. I don't quite follow, could you show some examples of this?

yasindagasan commented 3 years ago

@glenn-jocher thanks very much for the reply! Yes I agree and can definitely understand your workload!

  1. I agree with you in the sense that confusion matrices are more suitable for classification tasks. Although we might not obtain the exact useful information as in the classification confusion matrices, I was thinking it can still provide some useful information. I have come across the following codes that attempted to obtain a confusion matrix for object detection:

Code 1 (tensorflow) Code 2

I will create a PR if I happen to have something out of these.

  1. Plotly dashboard seems to be a good idea. We can also use some other available tools too but I am not sure how easy it would be to integrate them. I am currently using labellimg and CVAT tools for labelling the datasets. CVAT seems to be quite nice and there is an option to provide a model to do auto labelling. In terms of the difficulty of setting up, labelimg is pretty straightforward. CVAT needs a bit of time to configure.

  2. Sorry it was not clear. Maybe I can better explain by an example.

This part was more about uncertain labels. I sometimes work on image classification tasks where we have labels which were labelled by people from the domain. Although the labels are created by people with an understanding of the domain, due to the complexities of the objects in an image, labellers can make mistakes. Some images are at the borderline of two classes and can be difficult to differentiate by a naked eye. These examples are pretty easy to be confused by labellers.

What we generally do to handle such cases is to train a model (i.e. resnet50) and use trained weights to do feature extraction. We remove the last layer of the network and for a given image we obtain a vector of features with a length of 512 or 2048. To visualise the similarity and also detect outliers, we then do a dimensionality reduction using UMAP and reduce the dimension into 2 or 3 dimensions. Points closer in this created feature space is expected to be more similar to those which are far away. We then colour the points by the provided labels and detect images which were labelled wrongly. We then do necessary corrections on the labels. Below image can be an example for a feature space.

image

I was wondering if you think such an approach could be applicable for object detection somehow. I have previously attempted to do this but did not have time to experiment more on it. What I did was using the bounding boxes predicted by YOLO, I cropped the images and put them in separate images with a standard size using resizing and padding. I then trained a fastai model to obtain weights and did dimensionality reduction and clustering using DBSCAN.

One of the problems I encountered was that the objects can be in varying aspect ratio and resizing them might sometimes lead to cropping them or destroying the aspect ratio. I am not sure about its applicability for object detection but just wanted to discuss and get your thoughts on it.

glenn-jocher commented 3 years ago

@yasindagasan interesting. I've raised an issue on https://github.com/kaanakan/object_detection_confusion_matrix/issues/6 to ask the author if he might help with integration.

As you've noticed in 3, aspect ratio modifications from stretching and other considerations complicate box extraction and classification/detection interoperability.

We are actually working on a YOLOv5 classifier though, so this may be suitable for that. The classifier is very easy to build, it's simply a YOLOv5 backbone with a Classify() head: https://github.com/ultralytics/yolov5/blob/8d2d6d2349cc4732667888435e9f01912d80a4ba/models/common.py#L227-L237

Ownmarc commented 3 years ago

Your idea 2 and 3 are in line with the suggestion I made here: https://github.com/ultralytics/yolov5/issues/895

I really think this deserves some thoughts. I understand this would not really be of much use to increase mAP on Coco since you can't really change the labels, but the labels for custom model on custom datasets are almost always to place to start when you want to get better results from your Yolov5 models

glenn-jocher commented 3 years ago

@yasindagasan @Ownmarc 1 and 2 are definitely more closed-ended feasible ideas that may be implemented. 1 in particular can be plotted at the same time as the PR curve using the same TP and FP vectors perhaps. I recently updated a few plots BTW in https://github.com/ultralytics/yolov5/pull/1432 and https://github.com/ultralytics/yolov5/pull/1428, including the PR curves and the labels plots for better introspection.

3 is a bit more open ended, but I understand the desire for better failure mode analysis and post training introspection tools. This is somewhat in the same direction as active learning, or adapting your labels based on training feedback. I'll have to think about it.

One update for post training analysis is that you can use a confidence slider on Weights & Biases results to help you determine a best confidence threshold for deployment. This is rather new and useful, but mainly suitable for just that one task of determining a best real-world confidence threshold to use. You can see an example here (click the gear on the Media panel): https://wandb.ai/glenn-jocher/yolov5_tutorial/reports/YOLOv5-COCO128-Tutorial-Results--VmlldzozMDI5OTY

yasindagasan commented 3 years ago

@Ownmarc yes, very related with your #895 suggestions. Sorry I was not aware of that thread. I also experienced that labels on custom datasets have a huge impact on success. Any improvement on that would be very helpful.

@glenn-jocher I just have seen your recent updates. I like the PR curve colored by classes in #1428, it is very useful!

Classifier might sound suitable actually. I will be experimenting with it soon.

glenn-jocher commented 3 years ago

@yasindagasan @Ownmarc I've integrated a confusion matrix now into test.py. See PR https://github.com/ultralytics/yolov5/pull/1474

There's some unfortunate overlap in the computations going on inside the confusion matrix class and the mAP computation code, particularly in that they both compute IoU matrices separately (duplication of effort), but this will have to do for now. The confusion matrix adds about 5-10 seconds of wall clock time to test.py, ie. a typical YOLOv5m COCO test.py run will now take 1:25, up 10 seconds from 1:15 before.

image

yasindagasan commented 3 years ago

this is awesome thanks @glenn-jocher!

glenn-jocher commented 3 years ago

@yasindagasan I'll leave this issue open as https://github.com/ultralytics/yolov5/pull/1474 only partially satisfies the feature additions.

After considering the results a bit, I think unfortunately the conclusions you can draw from the confusion matrices in object detection may be somewhat limited, as it seems that by far the largest cross-class confusion is simply class {x} to background, regardless of x.

Still, any extra information should help everyone understand their results a bit better :)

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mansi-aggarwal-2504 commented 3 years ago

I was trying to obtain the confusion matrix using test.py after referring to this PR, and I'm using Colab. How can I obtain the matrix for my custom dataset? Do I have to run the test.py script and mention a parameter explicitly?

Edit: if the matrix is produced automatically at the end of training as mentioned here, how can I save it if in case I can'r visualise it in Colab? Also, I would want to produce the matrix was my testing data that I parse in detect.py. I have the predictions and I have the ground truth, can someone guide me so that I can do that?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 test.py automatically generates confusion matrices. Results are logged to the directory indicated, i.e. runs/test/exp

mansi-aggarwal-2504 commented 3 years ago

@mansi-aggarwal-2504 test.py automatically generates confusion matrices. Results are logged to the directory indicated, i.e. runs/test/exp

Thank you @glenn-jocher for such a prompt response! I must be doing something, I found it. This must be the matrix for validation data and I want to do the same thing for my test data with predictions and ground truth. Shall I make yaml file and point it towards this test dataset or is there a more efficient way? Also I want to understand how to interpret the matrix. I have a single class i.e. flower. confusion_matrix I understand the first column but I don't get the second column i.e. background FP mapped with flower with value 1.0 What does the second column him?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 you can evaluate test.py to any part of your dataset (train, val, test) using the --task flag: https://github.com/ultralytics/yolov5/blob/7b36e38cf8f3d3c08e973b18913ae8e41ff970b2/test.py#L297

glenn-jocher commented 3 years ago

The matrix indicates that 100% of your background FPs are caused by the flower category.

mansi-aggarwal-2504 commented 3 years ago

@glenn-jocher thank you very much. I will use the --task flag.

Also:

The matrix indicates that 100% of your background FPs are caused by the flower category.

Got it, thanks!

mansi-aggarwal-2504 commented 3 years ago

@mansi-aggarwal-2504 you can evaluate test.py to any part of your dataset (train, val, test) using the --task flag:

https://github.com/ultralytics/yolov5/blob/7b36e38cf8f3d3c08e973b18913ae8e41ff970b2/test.py#L297

I set the task to test and uploaded the ground truth of the test set. I received one confusion matrix. So this is the average of all images in my test set?

confusion_matrix-2

Is there a way to get a separate matrix for all images?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 that is correct, a set of images generates one confusion matrix.

mansi-aggarwal-2504 commented 3 years ago

@mansi-aggarwal-2504 that is correct, a set of images generates one confusion matrix.

Is there a way to get a separate matrix for all images?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 the confusion matrix already applies to all images.

mansi-aggarwal-2504 commented 3 years ago

@glenn-jocher but the resultant matrix is like an average result of images in the test set, right?

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 yes one confusion matrix is generated for the entire dataset.

Mohamed-Elredeny commented 3 years ago

image Can someone explain this for me

glenn-jocher commented 3 years ago

@Mohamed-Elredeny see https://en.wikipedia.org/wiki/Confusion_matrix

mansi-aggarwal-2504 commented 3 years ago

I noticed that when I run the test.py, the total number of objects detected is 4718 i.e. TP + FP confusion_matrix

But the total number of objects detected by detect.py was 4605. I also ran detect.py with same conf_thres and iou_thres values as default test.py values i.e. --iou-thres 0.6 --conf-thres 0.001 but the count is still different. How should I change the parameters in detect.py to get the same results as test.py because I want the object count and TP count as test.py.

EDIT: 300 flowers in all images when params are kept same as test.py

Screenshot 2021-06-15 at 2 08 45 PM

I also tried running test.py with --iou-thres=0.45 --conf-thres=0.25 which are the default for detect.py but there is still a difference between number of objects detected.

glenn-jocher commented 3 years ago

@mansi-aggarwal-2504 see test.py and metrics.py for TP and FP computation.

priyankadank commented 2 years ago

@mansi-aggarwal-2504 test.py automatically generates confusion matrices. Results are logged to the directory indicated, i.e. runs/test/exp

Thank you @glenn-jocher for such a prompt response! I must be doing something, I found it. This must be the matrix for validation data and I want to do the same thing for my test data with predictions and ground truth. Shall I make yaml file and point it towards this test dataset or is there a more efficient way? Also I want to understand how to interpret the matrix. I have a single class i.e. flower. confusion_matrix I understand the first column but I don't get the second column i.e. background FP mapped with flower with value 1.0 What does the second column him?

I am getting similar confusion matrix. I have following questions: 1. I did not understand fourth value i.e. Bottom right. Can you please explain? 2. How to decrease third value (Up Right) i.e. 1.0? 3. How to increase fourth value i.e. Bottom right?

glenn-jocher commented 2 years ago

@priyankadank 👋 Hello! Thanks for asking about improving YOLOv5 🚀 Training results. columns are normalized.

Most of the time good results can be obtained with no changes to the models or training settings, provided your dataset is sufficiently large and well labelled. If at first you don't get good results, there are steps you might be able to take to improve, but we always recommend users first train with all default settings before considering any changes. This helps establish a performance baseline and spot areas for improvement.

If you have questions about your training results we recommend you provide the maximum amount of information possible if you expect a helpful response, including results plots (train losses, val losses, P, R, mAP), PR curve, confusion matrix, training mosaics, test results and dataset statistics images such as labels.png. All of these are located in your project/name directory, typically yolov5/runs/train/exp.

We've put together a full guide for users looking to get the best results on their YOLOv5 trainings below.

Dataset

COCO Analysis

Model Selection

Larger models like YOLOv5x and YOLOv5x6 will produce better results in nearly all cases, but have more parameters, require more CUDA memory to train, and are slower to run. For mobile deployments we recommend YOLOv5s/m, for cloud deployments we recommend YOLOv5l/x. See our README table for a full comparison of all models.

YOLOv5 Models

Training Settings

Before modifying anything, first train with default settings to establish a performance baseline. A full list of train.py settings can be found in the train.py argparser.

Further Reading

If you'd like to know more a good place to start is Karpathy's 'Recipe for Training Neural Networks', which has great ideas for training that apply broadly across all ML domains: http://karpathy.github.io/2019/04/25/recipe/

Good luck 🍀 and let us know if you have any other questions!