ruidan / IMN-E2E-ABSA

Code and dataset for ACL2019 ‘‘An Interactive Multi-Task Learning Network for End-to-End Aspect-Based Sentiment Analysis’’.
Apache License 2.0
99 stars 18 forks source link

calculation of sentiment F1 #1

Closed zhchen18 closed 5 years ago

zhchen18 commented 5 years ago

Hi, Ruidan. Thanks for sharing the code and dataset!

I find that in the evaluation.py, the calculation of sentiment F1 is as follow: pr_s = (p_pos+p_neg+p_neu)/3.0 re_s = (r_pos+r_neg+r_neu)/3.0 f_s = 2 pr_s re_s/(pr_s+re_s)

But generally the calculation of macro-F1 should be f_pos = 2 p_pos r_pos /(p_pos+r_pos+1e-6) f_neg = 2 p_neg r_neg /(p_neg+r_neg+1e-6) f_neu = 2 p_neu r_neu /(p_neu+r_neu+1e-6) f_s = (f_pos+f_neg+f_neu)/3.0

Is this a mistake?

ruidan commented 5 years ago

To compute macro-F1, the first step is to compute the macro average precision and recall on different classes. Then the macro F1 is computed as the harmonic mean of these two figures. You can also find how it is computed as described in https://medium.com/@ramit.singh.pahwa/micro-macro-precision-recall-and-f-score-44439de1a044

On Fri, 2 Aug 2019, 12:50 PM Zhuang Chen, notifications@github.com wrote:

Hi, Ruidan. Thanks for sharing the code and dataset!

I find that in the evaluation.py, the calculation of sentiment F1 is as follow: pr_s = (p_pos+p_neg+p_neu)/3.0 re_s = (r_pos+r_neg+r_neu)/3.0 f_s = 2pr_sre_s/(pr_s+re_s)

But generally the calculation of macro-F1 should be f_pos = 2p_posr_pos /(p_pos+r_pos+1e-6) f_neg = 2p_negr_neg /(p_neg+r_neg+1e-6) f_neu = 2p_neur_neu /(p_neu+r_neu+1e-6) f_s = (f_pos+f_neg+f_neu)/3.0

Is this a mistake?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ruidan/IMN-E2E-ABSA/issues/1?email_source=notifications&email_token=ABR46IN2WDDQUD775DE7RRTQCQGPZA5CNFSM4II4VEYKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HDBATDA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABR46ILMWTUQUV74RI5K4G3QCQGPZANCNFSM4II4VEYA .

zhchen18 commented 5 years ago

I think macro-F1 is not calculated using "macro precision" and "macro recall". It should be the average value of F1 scores for all classes.

Please refer to the official scikit-learn document about calculating F1 score.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account."

I also do a simple test as below.

>>> from sklearn import metrics
>>> import numpy as np

>>> label = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
>>> pred  = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

>>> p_list, r_list, f1_list, _ = metrics.precision_recall_fscore_support(label, pred, labels=[0, 1, 2], average=None)

>>> macro_p = np.average(p_list)
>>> macro_r = np.average(r_list)
>>> he_macro_f1 = 2 * macro_p * macro_r/ (macro_p + macro_r)
he calculation: 0.3880936150874801

>>> my_macro_f1 = np.average(f1_list)
my calculation: 0.3861693861693862

>>> sk_macro_f1 = metrics.f1_score(label, pred, average='macro')
sk calculation: 0.3861693861693862
ruidan commented 5 years ago

The macro in sklearn is implemented in the way as you described. But I found many blogs and implementations using the other way as I implemented in the code. Now I feel sklearn is more correct after reading this https://github.com/dice-group/gerbil/issues/87

Regarding to the results in this paper,only F1-s will be affected when you apply a different procedure for calculating macro F1 as other metrics do not include the macro step.

On Sat, 3 Aug 2019, 7:00 AM Zhuang Chen, notifications@github.com wrote:

I think macro-F1 is not calculated using "macro precision" and "macro recall". It should be the average value of F1 scores for all classes.

Please refer to the official scikit-learn document about calculating F1 score.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account."

I also do a simple test as below.

from sklearn import metrics import numpy as np

label = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2] pred = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

p_list, r_list, f1list, = metrics.precision_recall_fscore_support(label, pred, labels=[0, 1, 2], average=None)

macro_p = np.average(p_list) macro_r = np.average(r_list) he_macro_f1 = 2 macro_p macro_r/ (macro_p + macro_r) he calculation: 0.3880936150874801

my_macro_f1 = np.average(f1_list) my calculation: 0.3861693861693862

sk_macro_f1 = metrics.f1_score(label, pred, average='macro') sk calculation: 0.3861693861693862

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ruidan/IMN-E2E-ABSA/issues/1?email_source=notifications&email_token=ABR46IKDL2GQDD72E664B3TQCUGFBA5CNFSM4II4VEYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PHEGQ#issuecomment-517894682, or mute the thread https://github.com/notifications/unsubscribe-auth/ABR46IPZRBTZVVGLEPKKYHTQCUGFBANCNFSM4II4VEYA .

zhchen18 commented 5 years ago

Yes, only F1-s will be affected and other metrics are still the same.
Thanks again for your attention. : )