Closed zhchen18 closed 5 years ago
To compute macro-F1, the first step is to compute the macro average precision and recall on different classes. Then the macro F1 is computed as the harmonic mean of these two figures. You can also find how it is computed as described in https://medium.com/@ramit.singh.pahwa/micro-macro-precision-recall-and-f-score-44439de1a044
On Fri, 2 Aug 2019, 12:50 PM Zhuang Chen, notifications@github.com wrote:
Hi, Ruidan. Thanks for sharing the code and dataset!
I find that in the evaluation.py, the calculation of sentiment F1 is as follow: pr_s = (p_pos+p_neg+p_neu)/3.0 re_s = (r_pos+r_neg+r_neu)/3.0 f_s = 2pr_sre_s/(pr_s+re_s)
But generally the calculation of macro-F1 should be f_pos = 2p_posr_pos /(p_pos+r_pos+1e-6) f_neg = 2p_negr_neg /(p_neg+r_neg+1e-6) f_neu = 2p_neur_neu /(p_neu+r_neu+1e-6) f_s = (f_pos+f_neg+f_neu)/3.0
Is this a mistake?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ruidan/IMN-E2E-ABSA/issues/1?email_source=notifications&email_token=ABR46IN2WDDQUD775DE7RRTQCQGPZA5CNFSM4II4VEYKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HDBATDA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABR46ILMWTUQUV74RI5K4G3QCQGPZANCNFSM4II4VEYA .
I think macro-F1 is not calculated using "macro precision" and "macro recall". It should be the average value of F1 scores for all classes.
Please refer to the official scikit-learn document about calculating F1 score.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account."
I also do a simple test as below.
>>> from sklearn import metrics
>>> import numpy as np
>>> label = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
>>> pred = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
>>> p_list, r_list, f1_list, _ = metrics.precision_recall_fscore_support(label, pred, labels=[0, 1, 2], average=None)
>>> macro_p = np.average(p_list)
>>> macro_r = np.average(r_list)
>>> he_macro_f1 = 2 * macro_p * macro_r/ (macro_p + macro_r)
he calculation: 0.3880936150874801
>>> my_macro_f1 = np.average(f1_list)
my calculation: 0.3861693861693862
>>> sk_macro_f1 = metrics.f1_score(label, pred, average='macro')
sk calculation: 0.3861693861693862
The macro in sklearn is implemented in the way as you described. But I found many blogs and implementations using the other way as I implemented in the code. Now I feel sklearn is more correct after reading this https://github.com/dice-group/gerbil/issues/87
Regarding to the results in this paper,only F1-s will be affected when you apply a different procedure for calculating macro F1 as other metrics do not include the macro step.
On Sat, 3 Aug 2019, 7:00 AM Zhuang Chen, notifications@github.com wrote:
I think macro-F1 is not calculated using "macro precision" and "macro recall". It should be the average value of F1 scores for all classes.
Please refer to the official scikit-learn document about calculating F1 score.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account."
I also do a simple test as below.
from sklearn import metrics import numpy as np
label = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2] pred = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
p_list, r_list, f1list, = metrics.precision_recall_fscore_support(label, pred, labels=[0, 1, 2], average=None)
macro_p = np.average(p_list) macro_r = np.average(r_list) he_macro_f1 = 2 macro_p macro_r/ (macro_p + macro_r) he calculation: 0.3880936150874801
my_macro_f1 = np.average(f1_list) my calculation: 0.3861693861693862
sk_macro_f1 = metrics.f1_score(label, pred, average='macro') sk calculation: 0.3861693861693862
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ruidan/IMN-E2E-ABSA/issues/1?email_source=notifications&email_token=ABR46IKDL2GQDD72E664B3TQCUGFBA5CNFSM4II4VEYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PHEGQ#issuecomment-517894682, or mute the thread https://github.com/notifications/unsubscribe-auth/ABR46IPZRBTZVVGLEPKKYHTQCUGFBANCNFSM4II4VEYA .
Yes, only F1-s will be affected and other metrics are still the same.
Thanks again for your attention. : )
Hi, Ruidan. Thanks for sharing the code and dataset!
I find that in the evaluation.py, the calculation of sentiment F1 is as follow: pr_s = (p_pos+p_neg+p_neu)/3.0 re_s = (r_pos+r_neg+r_neu)/3.0 f_s = 2 pr_s re_s/(pr_s+re_s)
But generally the calculation of macro-F1 should be f_pos = 2 p_pos r_pos /(p_pos+r_pos+1e-6) f_neg = 2 p_neg r_neg /(p_neg+r_neg+1e-6) f_neu = 2 p_neu r_neu /(p_neu+r_neu+1e-6) f_s = (f_pos+f_neg+f_neu)/3.0
Is this a mistake?