something678 / TodKat

Transformer encoder-decoder for emotion detection in dialogues
MIT License
57 stars 12 forks source link

Wrong Weighted-Avg F1 in All Experiments #10

Closed MrZilinXiao closed 2 years ago

MrZilinXiao commented 2 years ago

Hi, many thanks for your great work on ToDKat. It helps us a lot. However, I found that you made mistakes on all weighted F1 calculations in all experiments via metrics.f1_score, e.g. https://github.com/something678/TodKat/blob/abf6a13b8f00246773a25c6fde352a3ef3925015/src/DialogEvaluator_meld.py#L145 In scikit doc, f1_score takes y_true first, then y_pred. And the weights is based on y_true. In your exps, you put pred_list first... I downloaded your pretrained weights, and if this bug is fixed, the weighted F1 drops from 68.23 to 61.28.

Logs for reference: Before fixing the bug: 2021-12-16 11:46:30 - Accuracy: 0.6475 (1690/2610) 2021-12-16 11:46:30 - Weighted F1-macro with neutral: 0.6823 (1690/2610) 2021-12-16 11:46:30 - F1-micro with neutral: 0.6475 (1690/2610)

After fixing the bug: 2021-12-16 11:48:32 - Accuracy: 0.6475 (1690/2610) 2021-12-16 11:48:32 - Weighted F1-macro with neutral: 0.6128 (1690/2610) 2021-12-16 11:48:32 - F1-micro with neutral: 0.6475 (1690/2610)

something678 commented 2 years ago

Hi, I have trained a new model for MELD, please check. Thanks for the invaluable feedback.