ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.17k stars 3.44k forks source link

PRECISION-RECALL CURVE #898

Closed glenn-jocher closed 3 years ago

glenn-jocher commented 4 years ago

πŸš€ Feature

Precision Recall curves may be plotted by uncommenting code here when running test.py: https://github.com/ultralytics/yolov3/blob/1dc1761f45fe46f077694e1a70472cd7eb788e0c/utils/utils.py#L171

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --conf 0.001

For yolov3-spp-ultralytics.pt on COCO, the curves for all 80 classes look like this: PR_curve

For a single class 0, or person, the curve looks like this. During testing we evaluate the area under the curve as average precision, AP. The curve should ideally go from P=1, R=0 in the top left towards P=0, R=1 at the bottom right to capture the full AP (area under the curve). By varying conf-thres you can select a single point on the curve to run your model at. Depending on your application, you may prioritize precision over recall, or vice versa. PR_curve (1)

glenn-jocher commented 4 years ago

@TheophileBlard I'm thinking that perhaps we should plot P, R and mAP at seperate --conf-thres. mAP would naturally be computed near zero (i.e. 0.001), but P and R would perhaps be reported at 0.5 --conf-thres. This would be similar to Google AutoML reported results. https://cloud.google.com/vision/automl/docs/beginners-guide?authuser=1#how_do_i_interpret_the_precision-recall_curves

0.1 confidence

Screen Shot 2020-03-06 at 1 44 44 PM

0.5 confidence

Screen Shot 2020-03-06 at 1 43 11 PM

0.9 confidence

Screen Shot 2020-03-06 at 1 44 53 PM
TheophileBlard commented 4 years ago

@glenn-jocher Sounds great! Current P&R curves are quite misleading, as the 0.001 threshold is defined in the code.

glenn-jocher commented 4 years ago

@TheophileBlard all done in feea9c1a65c73475803847c83545b5e7ee6c528c. Thanks for raising the issue, I think this update will help everyone! Here is a before and after run of the cooc64img.data tutorial. Let me know if you see any other problems. results

rightly0716 commented 4 years ago

I may misunderstand the PR and RECALL at training stage. The plot below is what I got when training (using my custom data that has two classes: stop sign and yield sign, and I used the default setting to split data into train/val). You can see PR, RECALL and mAP are super bad (I used the default conf).

results

However, when I run the test code for all the data together, as below: python3 test.py --data data/stopsigns.data --cfg cfg/yolov3-spp-stopsigns.cfg \ --weights weights/yolov3-spp-ultralytics-stopsigns.pt I got results:

           Class    Images   Targets         P         R   mAP@0.5        F1: 100%|
             all       554       543     0.979     0.947     0.991     0.963
        stopsign       554       276      0.97     0.938      0.99     0.954
       yieldsign       554       267     0.988     0.955     0.992     0.971

Does it mean the model overfits the dataset a lot? But when I used the model to predict some random street pictures downloaded from internet, the performance seems okay.

glenn-jocher commented 4 years ago

@rightly0716 testing on your training data is only useful us a sanity check. It serves no purpose in terms of checking for generalization, which is what the test set is for. You P and R don't matter, as you select these yourself.

mAP is the metric that matters. If your training results are not to your liking, then its time for you to experiment on ways to improve them.

rightly0716 commented 4 years ago

I see. I have only ~500 labelled data, and am wondering whether that can be a reason. Will do more deep analysis and see.

Thanks!

glenn-jocher commented 4 years ago

@rightly0716 definitely more data would help. Also make sure you are training at an appropriate image size, and check your train.jpg and test.jpg images for correct labeling.

tinothy22 commented 4 years ago

I canceled the code commented by ap_per_class in utils as follows: # Plot fig, ax = plt.subplots(1, 1, figsize=(5, 5)) ax.plot(recall, precision) ax.set_xlabel('Recall') ax.set_ylabel('Precision') ax.set_xlim(0, 1.01) ax.set_ylim(0, 1.01) fig.tight_layout() fig.savefig('PR_curve.png', dpi=300) There are two classes of my data set, but there is only one class in the PR curve graph. How can I solve it? PR_curve

glenn-jocher commented 4 years ago

@tinothy22 ah yes, I see what you mean. The graph is inside the for loop, so it will plot one graph per class and save it (overwriting the previous one). If you want to overlay all of your classes you must modify the plotting code a bit, to create the figure before the loop, plot as is, and then save the figure after the loop.

tinothy22 commented 4 years ago

thank you ,I try to change the code

glenn-jocher commented 4 years ago

@tinothy22 we definitely want to add this to tensorboard output in the future, for now unfortunately this is the only way to do it.

tinothy22 commented 4 years ago

that's great! thank you for your guidance, I have got the PR curve

jas-nat commented 4 years ago

Hello thank you for the clear explanation. I just want to clarify my understanding of precision and recall curve threshold, as I have been reading this over and over again.

  1. Is it true that threshold can vary for each label?
  2. In feea9c1 why did you change the PR_threshold to be 0.5, but currently when I checked the code it is changed to be 0.1? Where should we specify the threshold for drawing the precision and recall curve.
  3. Is it the same if this line is changed to precision = tpc / n_p ? https://github.com/ultralytics/yolov3/blob/82f653b0f579db97f8908800d45e8f5287f79bd3/utils/utils.py#L177

Thanks and would like to hear your answers!

glenn-jocher commented 4 years ago

@jas-nat the curve has no threshold, it is plotted for all thresholds.

jas-nat commented 4 years ago

@glenn-jocher Got it. Thanks for answering!

risemeup commented 4 years ago

hello,thank you for the clear explanation.The curves with silder seems useful.It can help me to select conf-thres.I want to know how to do this.

glenn-jocher commented 4 years ago

@risemeup you'd need to code up an interactive version of the plot above with something like plotly dashboard maybe. Let me know if you come up with a solution!

jas-nat commented 4 years ago

hello,thank you for the clear explanation.The curves with silder seems useful.It can help me to select conf-thres.I want to know how to do this.

Hi, I want to ask how can we know the best theshold from the curve? Is it from the results.txt or where?

risemeup commented 4 years ago

@jas-nat There does not seem to be such information in result.txt.The optimal threshold is near the turning point of the PR curve,which have both high precision and recall.You can add some code in ap_per_class function to write every confidence about PR curve and find the best conf-thres.So it will be very convenient if we can plot the curve with slider.

jas-nat commented 4 years ago

@risemeup I am trying to implement it. Can you guide me how to find the best conf-thres?

I followed this tutorial but it applied precision_recall_curve from scikit-learn. A little confused in finding the corresponding variables in utils.py

Can high F1 score indicate the best conf-thres?

glenn-jocher commented 4 years ago

@risemeup @jas-nat there is no "optimal" or "best" threshold. It is up to the user to set this however they like, depending on the compromise they desire between increasing recall and reducing FPs.

risemeup commented 4 years ago

@jas-nat I visited this tutorial.it used accuracy to find the corresponding conf-thres,but there is no accuracy in target detection.it says "a certain binary classification metric" at the beginning of the article,so this method is not suitable. I don't think the best threshold is calculated by some formula.It's depend on your project.For example,some project need high recall and the precision isn't very important and other projects may require the opposite.So the threshold should be appropriate for your own project. @glenn-jocher thank you for your reply!

jas-nat commented 4 years ago

@glenn-jocher @risemeup Thank you for the replies!

jas-nat commented 4 years ago

Sorry I am still trying to understand the codes. https://github.com/ultralytics/yolov3/blob/bdf546150df5aaeacd1eb415b5dc830096079880/utils/utils.py#L188 In that line, as far as I understand, it will create a new interpolation point referring to conf[i] for x axis and precision[:, 0] or recall[:, 0] for y axis, am I right?

I have 2 questions:

  1. I don't see whenpr_score changes in the codes. Doesn't np.interp() function need the new points at the first argument to draw the interpolation? If I miss something, let me know.
  2. when I try to print p[ci] it only shows 1 value. Does it mean the generated interpolated value?

For your information, I only train for 1 label.

risemeup commented 4 years ago

@jas-nat I will try to explain two questions from my understanding.If I make mistake,let me know.

  1. pr_scorewas set to a fixed parameter.we can get a set of precision,recall and conf when drawing PR curve.But we only need one precision to describe current training status,so we can select the precision when conf-thres set as pr_score. https://github.com/ultralytics/yolov3/blob/8241bf67bb0cc1c11634bdb4cc76e06ac072192b/utils/utils.py#L167

2.Yes,p[ci] is generated by interpolation.It should be explained above.

I wise it can help you. If there is anything wrong, please point it out.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

goldwater668 commented 4 years ago

@glenn-jocher

  1. According to what you said above, is the P, R, mAP and F1 obtained from training your own data have no reference value? Is there no value in getting P, R, mAP and F1 from the test? How to evaluate the quality of the training model?

  2. test.py Why is conf-thres set to 0.01?

  3. I only use one category, do I need to set --single-cls?

thank you!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abhiagwl4262 commented 2 years ago

@glenn-jocher

  1. why the last row of the recall variable in the following line is not returned as the recall value? Why calculate r in the next line? https://github.com/ultralytics/yolov3/blob/0f80f2f9054dd06d34c51e73ea1bc5ba808fed18/utils/metrics.py#L59-L60

  2. why the last row of the precision variable in the following line is not returned as the precision value? Why calculate p in the next line? https://github.com/ultralytics/yolov3/blob/0f80f2f9054dd06d34c51e73ea1bc5ba808fed18/utils/metrics.py#L63-L64

glenn-jocher commented 2 years ago

@abhiagwl4262 you can put breakpoints here to see what these variables are, but recall and precision there are just a very long list of values, they need to be evaluated at a specific point, which is what the interp functions are doing.

abhiagwl4262 commented 2 years ago

@glenn-jocher I think there is an issue. As you are sorting the prediction based on the object confidence score, the map is coming high even when there are too many false positives. because false positives fall at the end of the PR graph where because of low sampling rate only very few false positives are considered for ap calculation.

and what do you mean by the need to be evaluated at a specific point?

I have created a new issue for this discussion https://github.com/ultralytics/yolov3/issues/1890

glenn-jocher commented 2 years ago

@abhiagwl4262 P and R are returned a specific confidence.

AjanthaAravind commented 6 months ago

@glenn-jocher I have the results of two models. I want to show the comparison of the PR curve graph. How to get a comparison curve for these models like a Yolo-generated graph?

glenn-jocher commented 6 months ago

@AjanthaAravind hello! For comparing the PR curves of two models directly within the same graph, you'll need to slightly modify the plotting code. Generally, after you run tests for both models and obtain their precision-recall data, you'll plot them using a library like Matplotlib.

Here's a streamlined approach:

  1. Save the precision and recall values for each model after running the test. You can modify test.py to output these values into a file or directly use them if you're running interactively.

  2. Use Matplotlib to plot both curves in one graph. Here's a simple template:

    import matplotlib.pyplot as plt
    
    # Assuming pr1 and pr2 are tuples/lists containing precision and recall
    # values for model 1 and model 2 respectively: (precision, recall)
    pr1 = (precision_model1, recall_model1)
    pr2 = (precision_model2, recall_model2)
    
    plt.figure(figsize=(10, 7))
    plt.plot(pr1[1], pr1[0], label='Model 1')
    plt.plot(pr2[1], pr2[0], label='Model 2')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Comparison')
    plt.legend()
    plt.grid(True)
    plt.show()
  3. Adjust precision_model1, recall_model1, precision_model2, recall_model2 with your actual data.

This will give you a visual comparison of how each model performs in terms of precision and recall across different confidence thresholds. Hope that helps! πŸš€

AjanthaAravind commented 6 months ago

@glenn-jocher I did the same, but I can't get the same graph as the yolov8 PR curve design from the left top corner to the right bottom corner(from 1.0 yaxis to 1.0 xaxis), and I used 5 classes. image i am getting image like this.

but i want the comparison graph like this Screenshot 2024-03-30 154735

glenn-jocher commented 6 months ago

@AjanthaAravind hey there! 😊 It looks like you're aiming for a specific style of PR curve. To achieve the PR curve similar to YOLOv8, starting from the top left to the bottom right, you'll want to ensure your precision and recall are calculated correctly for all your classes and that you're plotting them in a cumulative way if you're looking for an overall curve.

Make sure your precision (y-axis) and recall (x-axis) values range from 0 to 1. If your graph isn't stretching all the way to 1 on both axes, you might want to verify your data.

Here's a simplified example to plot a PR curve for two classes, which you can expand for your five classes:

import matplotlib.pyplot as plt

# Sample precision and recall values for two classes
precision_class1 = [0.9, 0.85, 0.83, 0.80, 0.75]
recall_class1 = [0.1, 0.2, 0.3, 0.4, 0.5]

precision_class2 = [0.95, 0.92, 0.90, 0.87, 0.83]
recall_class2 = [0.1, 0.2, 0.3, 0.4, 0.5]

plt.figure()
plt.plot(recall_class1, precision_class1, label='Class 1')
plt.plot(recall_class2, precision_class2, label='Class 2')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR Curve Comparison')
plt.legend()
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.grid(True)
plt.show()

For a comprehensive comparison, consider plotting each class separately or all together depending on your requirement. Remember, for the plot to span from the top left corner to the bottom right effectively, your datasets need to be complete and correctly calculated. Keep striving for the best representation! 🌟

jahid-coder commented 4 months ago

Please explain those graph. I can't understand from those graph what actually happened here.

F1_curve P_curve R_curve

glenn-jocher commented 4 months ago

Hey there! Let's break down what each graph represents:

Each graph likely plots these metrics against different thresholds or parameter settings, showing how the model's performance varies. In machine learning, tuning these values can help you achieve the best model performance for your specific needs.

I hope this clears things up! 😊