Closed glenn-jocher closed 3 years ago
@TheophileBlard I'm thinking that perhaps we should plot P, R and mAP at seperate --conf-thres
. mAP would naturally be computed near zero (i.e. 0.001), but P and R would perhaps be reported at 0.5 --conf-thres
. This would be similar to Google AutoML reported results.
https://cloud.google.com/vision/automl/docs/beginners-guide?authuser=1#how_do_i_interpret_the_precision-recall_curves
@glenn-jocher Sounds great! Current P&R curves are quite misleading, as the 0.001 threshold is defined in the code.
@TheophileBlard all done in feea9c1a65c73475803847c83545b5e7ee6c528c. Thanks for raising the issue, I think this update will help everyone! Here is a before and after run of the cooc64img.data tutorial. Let me know if you see any other problems.
I may misunderstand the PR and RECALL at training stage. The plot below is what I got when training (using my custom data that has two classes: stop sign and yield sign, and I used the default setting to split data into train/val). You can see PR, RECALL and mAP are super bad (I used the default conf).
However, when I run the test code for all the data together, as below:
python3 test.py --data data/stopsigns.data --cfg cfg/yolov3-spp-stopsigns.cfg \ --weights weights/yolov3-spp-ultralytics-stopsigns.pt
I got results:
Class Images Targets P R mAP@0.5 F1: 100%|
all 554 543 0.979 0.947 0.991 0.963
stopsign 554 276 0.97 0.938 0.99 0.954
yieldsign 554 267 0.988 0.955 0.992 0.971
Does it mean the model overfits the dataset a lot? But when I used the model to predict some random street pictures downloaded from internet, the performance seems okay.
@rightly0716 testing on your training data is only useful us a sanity check. It serves no purpose in terms of checking for generalization, which is what the test set is for. You P and R don't matter, as you select these yourself.
mAP is the metric that matters. If your training results are not to your liking, then its time for you to experiment on ways to improve them.
I see. I have only ~500 labelled data, and am wondering whether that can be a reason. Will do more deep analysis and see.
Thanks!
@rightly0716 definitely more data would help. Also make sure you are training at an appropriate image size, and check your train.jpg and test.jpg images for correct labeling.
I canceled the code commented by ap_per_class in utils as follows:
# Plot fig, ax = plt.subplots(1, 1, figsize=(5, 5)) ax.plot(recall, precision) ax.set_xlabel('Recall') ax.set_ylabel('Precision') ax.set_xlim(0, 1.01) ax.set_ylim(0, 1.01) fig.tight_layout() fig.savefig('PR_curve.png', dpi=300)
There are two classes of my data set, but there is only one class in the PR curve graph. How can I solve it?
@tinothy22 ah yes, I see what you mean. The graph is inside the for
loop, so it will plot one graph per class and save it (overwriting the previous one). If you want to overlay all of your classes you must modify the plotting code a bit, to create the figure before the loop, plot as is, and then save the figure after the loop.
thank you ,I try to change the code
@tinothy22 we definitely want to add this to tensorboard output in the future, for now unfortunately this is the only way to do it.
that's great! thank you for your guidance, I have got the PR curve
Hello thank you for the clear explanation. I just want to clarify my understanding of precision and recall curve threshold, as I have been reading this over and over again.
precision = tpc / n_p
? https://github.com/ultralytics/yolov3/blob/82f653b0f579db97f8908800d45e8f5287f79bd3/utils/utils.py#L177Thanks and would like to hear your answers!
@jas-nat the curve has no threshold, it is plotted for all thresholds.
@glenn-jocher Got it. Thanks for answering!
hello,thank you for the clear explanation.The curves with silder seems useful.It can help me to select conf-thres.I want to know how to do this.
@risemeup you'd need to code up an interactive version of the plot above with something like plotly dashboard maybe. Let me know if you come up with a solution!
hello,thank you for the clear explanation.The curves with silder seems useful.It can help me to select conf-thres.I want to know how to do this.
Hi, I want to ask how can we know the best theshold from the curve? Is it from the results.txt or where?
@jas-nat There does not seem to be such information in result.txt.The optimal threshold is near the turning point of the PR curve,which have both high precision and recall.You can add some code in ap_per_class function to write every confidence about PR curve and find the best conf-thres.So it will be very convenient if we can plot the curve with slider.
@risemeup I am trying to implement it. Can you guide me how to find the best conf-thres?
I followed this tutorial but it applied precision_recall_curve
from scikit-learn
. A little confused in finding the corresponding variables in utils.py
Can high F1 score indicate the best conf-thres?
@risemeup @jas-nat there is no "optimal" or "best" threshold. It is up to the user to set this however they like, depending on the compromise they desire between increasing recall and reducing FPs.
@jas-nat I visited this tutorial.it used accuracy to find the corresponding conf-thres,but there is no accuracy in target detection.it says "a certain binary classification metric" at the beginning of the article,so this method is not suitable. I don't think the best threshold is calculated by some formula.It's depend on your project.For example,some project need high recall and the precision isn't very important and other projects may require the opposite.So the threshold should be appropriate for your own project. @glenn-jocher thank you for your replyοΌ
@glenn-jocher @risemeup Thank you for the replies!
Sorry I am still trying to understand the codes.
https://github.com/ultralytics/yolov3/blob/bdf546150df5aaeacd1eb415b5dc830096079880/utils/utils.py#L188
In that line, as far as I understand, it will create a new interpolation point referring to conf[i]
for x axis and precision[:, 0]
or recall[:, 0]
for y axis, am I right?
I have 2 questions:
pr_score
changes in the codes. Doesn't np.interp()
function need the new points at the first argument to draw the interpolation? If I miss something, let me know.p[ci]
it only shows 1 value. Does it mean the generated interpolated value?For your information, I only train for 1 label.
@jas-nat I will try to explain two questions from my understanding.If I make mistake,let me know.
pr_score
was set to a fixed parameter.we can get a set of precision,recall and conf when drawing PR curve.But we only need one precision to describe current training status,so we can select the precision when conf-thres
set as pr_score
.
https://github.com/ultralytics/yolov3/blob/8241bf67bb0cc1c11634bdb4cc76e06ac072192b/utils/utils.py#L1672.Yes,p[ci]
is generated by interpolation.It should be explained above.
I wise it can help you. If there is anything wrong, please point it out.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@glenn-jocher
According to what you said above, is the P, R, mAP and F1 obtained from training your own data have no reference value? Is there no value in getting P, R, mAP and F1 from the test? How to evaluate the quality of the training model?
test.py Why is conf-thres set to 0.01?
I only use one category, do I need to set --single-cls?
thank you!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@glenn-jocher
why the last row of the recall variable in the following line is not returned as the recall value? Why calculate r in the next line? https://github.com/ultralytics/yolov3/blob/0f80f2f9054dd06d34c51e73ea1bc5ba808fed18/utils/metrics.py#L59-L60
why the last row of the precision variable in the following line is not returned as the precision value? Why calculate p in the next line? https://github.com/ultralytics/yolov3/blob/0f80f2f9054dd06d34c51e73ea1bc5ba808fed18/utils/metrics.py#L63-L64
@abhiagwl4262 you can put breakpoints here to see what these variables are, but recall and precision there are just a very long list of values, they need to be evaluated at a specific point, which is what the interp functions are doing.
@glenn-jocher I think there is an issue. As you are sorting the prediction based on the object confidence score, the map is coming high even when there are too many false positives. because false positives fall at the end of the PR graph where because of low sampling rate only very few false positives are considered for ap calculation.
and what do you mean by the need to be evaluated at a specific point?
I have created a new issue for this discussion https://github.com/ultralytics/yolov3/issues/1890
@abhiagwl4262 P and R are returned a specific confidence.
@glenn-jocher I have the results of two models. I want to show the comparison of the PR curve graph. How to get a comparison curve for these models like a Yolo-generated graph?
@AjanthaAravind hello! For comparing the PR curves of two models directly within the same graph, you'll need to slightly modify the plotting code. Generally, after you run tests for both models and obtain their precision-recall data, you'll plot them using a library like Matplotlib.
Here's a streamlined approach:
Save the precision and recall values for each model after running the test. You can modify test.py
to output these values into a file or directly use them if you're running interactively.
Use Matplotlib to plot both curves in one graph. Here's a simple template:
import matplotlib.pyplot as plt
# Assuming pr1 and pr2 are tuples/lists containing precision and recall
# values for model 1 and model 2 respectively: (precision, recall)
pr1 = (precision_model1, recall_model1)
pr2 = (precision_model2, recall_model2)
plt.figure(figsize=(10, 7))
plt.plot(pr1[1], pr1[0], label='Model 1')
plt.plot(pr2[1], pr2[0], label='Model 2')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Comparison')
plt.legend()
plt.grid(True)
plt.show()
Adjust precision_model1
, recall_model1
, precision_model2
, recall_model2
with your actual data.
This will give you a visual comparison of how each model performs in terms of precision and recall across different confidence thresholds. Hope that helps! π
@glenn-jocher I did the same, but I can't get the same graph as the yolov8 PR curve design from the left top corner to the right bottom corner(from 1.0 yaxis to 1.0 xaxis), and I used 5 classes. i am getting image like this.
but i want the comparison graph like this
@AjanthaAravind hey there! π It looks like you're aiming for a specific style of PR curve. To achieve the PR curve similar to YOLOv8, starting from the top left to the bottom right, you'll want to ensure your precision and recall are calculated correctly for all your classes and that you're plotting them in a cumulative way if you're looking for an overall curve.
Make sure your precision (y-axis) and recall (x-axis) values range from 0 to 1. If your graph isn't stretching all the way to 1 on both axes, you might want to verify your data.
Here's a simplified example to plot a PR curve for two classes, which you can expand for your five classes:
import matplotlib.pyplot as plt
# Sample precision and recall values for two classes
precision_class1 = [0.9, 0.85, 0.83, 0.80, 0.75]
recall_class1 = [0.1, 0.2, 0.3, 0.4, 0.5]
precision_class2 = [0.95, 0.92, 0.90, 0.87, 0.83]
recall_class2 = [0.1, 0.2, 0.3, 0.4, 0.5]
plt.figure()
plt.plot(recall_class1, precision_class1, label='Class 1')
plt.plot(recall_class2, precision_class2, label='Class 2')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('PR Curve Comparison')
plt.legend()
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.grid(True)
plt.show()
For a comprehensive comparison, consider plotting each class separately or all together depending on your requirement. Remember, for the plot to span from the top left corner to the bottom right effectively, your datasets need to be complete and correctly calculated. Keep striving for the best representation! π
Please explain those graph. I can't understand from those graph what actually happened here.
Hey there! Let's break down what each graph represents:
F1 Curve: This shows the F1 score, which is a balance between precision and recall. It combines both values to give a single score that helps evaluate the overall performance of the model. A higher F1 score indicates better performance.
P Curve: Represents the Precision curve. Precision is the ratio of correctly predicted positive observations to the total predicted positives. High precision relates to a low false positive rate.
R Curve: This is the Recall curve. Recall is the ratio of correctly predicted positive observations to all observations in actual class - true positive rate.
Each graph likely plots these metrics against different thresholds or parameter settings, showing how the model's performance varies. In machine learning, tuning these values can help you achieve the best model performance for your specific needs.
I hope this clears things up! π
π Feature
Precision Recall curves may be plotted by uncommenting code here when running test.py: https://github.com/ultralytics/yolov3/blob/1dc1761f45fe46f077694e1a70472cd7eb788e0c/utils/utils.py#L171
For yolov3-spp-ultralytics.pt on COCO, the curves for all 80 classes look like this:
For a single class 0, or person, the curve looks like this. During testing we evaluate the area under the curve as average precision, AP. The curve should ideally go from P=1, R=0 in the top left towards P=0, R=1 at the bottom right to capture the full AP (area under the curve). By varying
conf-thres
you can select a single point on the curve to run your model at. Depending on your application, you may prioritize precision over recall, or vice versa.