Issue reproducing the results of BERT model

Glovesme commented 4 years ago

Hi all,

I tried to reproduce the results of the paper, however I observe that there is a ~4-5 percentage points drop for the macro-average F1 score for the tested positive event. Based on the results of the paper, the macro-average F1 score for the tested positive event is 54.9% (divided by 10 slots) but based on the preprocessing scripts and the code for the multitask learning BERT model, I am able to obtain 50.2%. I was able to collect 7149 tweets. Do you think that the difference in terms of macro-average F1 score might be due to the fact that the dataset is slightly different? However, 4 percentage points seem to be quite a lot. Is there anybody else experiencing the same issue?

Best, Xiangyu

s4zong commented 4 years ago

Hello,

Could you paste your per slot experimental results (along with the number of slots evaluated) here? There is a copy of experimental results under /result folder.

Thanks,

Glovesme commented 4 years ago

Hello,

This is the copy of the experimental results for event TESTED POSITIVE: {"best_dev_threshold": {"age": 0.2, "close_contact": 0.3, "employer": 0.1, "gender_male": 0.9, "gender_female": 0.8, "name": 0.7, "recent_travel": 0.1, "relation": 0.2, "when": 0.1, "where": 0.8}, "best_dev_F1s": {"age": 0.6896551724137931, "close_contact": 0.3902439024390244, "employer": 0.4166666666666667, "gender_male": 0.7226890756302522, "gender_female": 0.6037735849056604, "name": 0.7349177330895795, "recent_travel": 0.36, "relation": 0.8, "when": 0.4615384615384615, "where": 0.5193370165745856}, "dev_t_F1_P_Rs": {"age": [[0.1, 0.6666666666666666, 0.7142857142857143, 0.625, 16.0, 10.0, 4.0, 6.0], [0.2, 0.6896551724137931, 0.7692307692307693, 0.625, 16.0, 10.0, 3.0, 6.0], [0.3, 0.4800000000000001, 0.6666666666666666, 0.375, 16.0, 6.0, 3.0, 10.0], [0.4, 0.4166666666666667, 0.625, 0.3125, 16.0, 5.0, 3.0, 11.0], [0.5, 0.43478260869565216, 0.7142857142857143, 0.3125, 16.0, 5.0, 2.0, 11.0], [0.6, 0.43478260869565216, 0.7142857142857143, 0.3125, 16.0, 5.0, 2.0, 11.0], [0.7, 0.45454545454545453, 0.8333333333333334, 0.3125, 16.0, 5.0, 1.0, 11.0], [0.8, 0.38095238095238093, 0.8, 0.25, 16.0, 4.0, 1.0, 12.0], [0.9, 0.3157894736842105, 1.0, 0.1875, 16.0, 3.0, 0.0, 13.0]], "close_contact": [[0.1, 0.29629629629629634, 0.23529411764705882, 0.4, 20.0, 8.0, 26.0, 12.0], [0.2, 0.3404255319148936, 0.2962962962962963, 0.4, 20.0, 8.0, 19.0, 12.0], [0.3, 0.3902439024390244, 0.38095238095238093, 0.4, 20.0, 8.0, 13.0, 12.0], [0.4, 0.3157894736842105, 0.3333333333333333, 0.3, 20.0, 6.0, 12.0, 14.0], [0.5, 0.2777777777777778, 0.3125, 0.25, 20.0, 5.0, 11.0, 15.0], [0.6, 0.2285714285714286, 0.26666666666666666, 0.2, 20.0, 4.0, 11.0, 16.0], [0.7, 0.2, 0.3, 0.15, 20.0, 3.0, 7.0, 17.0], [0.8, 0.20689655172413793, 0.3333333333333333, 0.15, 20.0, 3.0, 6.0, 17.0], [0.9, 0.14285714285714288, 0.25, 0.1, 20.0, 2.0, 6.0, 18.0]], "employer": [[0.1, 0.4166666666666667, 0.42857142857142855, 0.40540540540540543, 37.0, 15.0, 20.0, 22.0], [0.2, 0.4000000000000001, 0.42424242424242425, 0.3783783783783784, 37.0, 14.0, 19.0, 23.0], [0.3, 0.38235294117647056, 0.41935483870967744, 0.35135135135135137, 37.0, 13.0, 18.0, 24.0], [0.4, 0.3880597014925374, 0.43333333333333335, 0.35135135135135137, 37.0, 13.0, 17.0, 24.0], [0.5, 0.3692307692307692, 0.42857142857142855, 0.32432432432432434, 37.0, 12.0, 16.0, 25.0], [0.6, 0.375, 0.4444444444444444, 0.32432432432432434, 37.0, 12.0, 15.0, 25.0], [0.7, 0.38095238095238093, 0.46153846153846156, 0.32432432432432434, 37.0, 12.0, 14.0, 25.0], [0.8, 0.38596491228070173, 0.55, 0.2972972972972973, 37.0, 11.0, 9.0, 26.0], [0.9, 0.36363636363636365, 0.5555555555555556, 0.2702702702702703, 37.0, 10.0, 8.0, 27.0]], "gender_male": [[0.1, 0.6293706293706294, 0.5487804878048781, 0.7377049180327869, 61.0, 45.0, 37.0, 16.0], [0.2, 0.6666666666666667, 0.6081081081081081, 0.7377049180327869, 61.0, 45.0, 29.0, 16.0], [0.3, 0.6766917293233082, 0.625, 0.7377049180327869, 61.0, 45.0, 27.0, 16.0], [0.4, 0.6870229007633588, 0.6428571428571429, 0.7377049180327869, 61.0, 45.0, 25.0, 16.0], [0.5, 0.6870229007633588, 0.6428571428571429, 0.7377049180327869, 61.0, 45.0, 25.0, 16.0], [0.6, 0.6976744186046512, 0.6617647058823529, 0.7377049180327869, 61.0, 45.0, 23.0, 16.0], [0.7, 0.703125, 0.6716417910447762, 0.7377049180327869, 61.0, 45.0, 22.0, 16.0], [0.8, 0.7154471544715446, 0.7096774193548387, 0.7213114754098361, 61.0, 44.0, 18.0, 17.0], [0.9, 0.7226890756302522, 0.7413793103448276, 0.7049180327868853, 61.0, 43.0, 15.0, 18.0]], "gender_female": [[0.1, 0.5625, 0.5454545454545454, 0.5806451612903226, 31.0, 18.0, 15.0, 13.0], [0.2, 0.5901639344262295, 0.6, 0.5806451612903226, 31.0, 18.0, 12.0, 13.0], [0.3, 0.5901639344262295, 0.6, 0.5806451612903226, 31.0, 18.0, 12.0, 13.0], [0.4, 0.6000000000000001, 0.6206896551724138, 0.5806451612903226, 31.0, 18.0, 11.0, 13.0], [0.5, 0.5762711864406779, 0.6071428571428571, 0.5483870967741935, 31.0, 17.0, 11.0, 14.0], [0.6, 0.5762711864406779, 0.6071428571428571, 0.5483870967741935, 31.0, 17.0, 11.0, 14.0], [0.7, 0.5714285714285714, 0.64, 0.5161290322580645, 31.0, 16.0, 9.0, 15.0], [0.8, 0.6037735849056604, 0.7272727272727273, 0.5161290322580645, 31.0, 16.0, 6.0, 15.0], [0.9, 0.6037735849056604, 0.7272727272727273, 0.5161290322580645, 31.0, 16.0, 6.0, 15.0]], "name": [[0.1, 0.7219343696027634, 0.7013422818791947, 0.7437722419928826, 281.0, 209.0, 89.0, 72.0], [0.2, 0.7218309859154929, 0.7142857142857143, 0.7295373665480427, 281.0, 205.0, 82.0, 76.0], [0.3, 0.7208480565371026, 0.7157894736842105, 0.7259786476868327, 281.0, 204.0, 81.0, 77.0], [0.4, 0.725, 0.7275985663082437, 0.7224199288256228, 281.0, 203.0, 76.0, 78.0], [0.5, 0.7266187050359713, 0.7345454545454545, 0.7188612099644128, 281.0, 202.0, 73.0, 79.0], [0.6, 0.7305605786618444, 0.7426470588235294, 0.7188612099644128, 281.0, 202.0, 70.0, 79.0], [0.7, 0.7349177330895795, 0.7556390977443609, 0.7153024911032029, 281.0, 201.0, 65.0, 80.0], [0.8, 0.7333333333333334, 0.7644787644787645, 0.7046263345195729, 281.0, 198.0, 61.0, 83.0], [0.9, 0.7200000000000001, 0.7745901639344263, 0.6725978647686833, 281.0, 189.0, 55.0, 92.0]], "recent_travel": [[0.1, 0.36, 0.47368421052631576, 0.2903225806451613, 31.0, 9.0, 10.0, 22.0], [0.2, 0.32653061224489793, 0.4444444444444444, 0.25806451612903225, 31.0, 8.0, 10.0, 23.0], [0.3, 0.29166666666666663, 0.4117647058823529, 0.22580645161290322, 31.0, 7.0, 10.0, 24.0], [0.4, 0.26086956521739135, 0.4, 0.1935483870967742, 31.0, 6.0, 9.0, 25.0], [0.5, 0.26666666666666666, 0.42857142857142855, 0.1935483870967742, 31.0, 6.0, 8.0, 25.0], [0.6, 0.2857142857142857, 0.5454545454545454, 0.1935483870967742, 31.0, 6.0, 5.0, 25.0], [0.7, 0.19999999999999998, 0.4444444444444444, 0.12903225806451613, 31.0, 4.0, 5.0, 27.0], [0.8, 0.15384615384615383, 0.375, 0.0967741935483871, 31.0, 3.0, 5.0, 28.0], [0.9, 0.1081081081081081, 0.3333333333333333, 0.06451612903225806, 31.0, 2.0, 4.0, 29.0]], "relation": [[0.1, 0.6666666666666666, 0.5, 1.0, 2.0, 2.0, 2.0, 0.0], [0.2, 0.8, 0.6666666666666666, 1.0, 2.0, 2.0, 1.0, 0.0], [0.3, 0.8, 0.6666666666666666, 1.0, 2.0, 2.0, 1.0, 0.0], [0.4, 0.8, 0.6666666666666666, 1.0, 2.0, 2.0, 1.0, 0.0], [0.5, 0.8, 0.6666666666666666, 1.0, 2.0, 2.0, 1.0, 0.0], [0.6, 0.5, 0.5, 0.5, 2.0, 1.0, 1.0, 1.0], [0.7, 0.5, 0.5, 0.5, 2.0, 1.0, 1.0, 1.0], [0.8, 0.5, 0.5, 0.5, 2.0, 1.0, 1.0, 1.0], [0.9, 0.5, 0.5, 0.5, 2.0, 1.0, 1.0, 1.0]], "when": [[0.1, 0.4615384615384615, 0.4, 0.5454545454545454, 11.0, 6.0, 9.0, 5.0], [0.2, 0.43478260869565216, 0.4166666666666667, 0.45454545454545453, 11.0, 5.0, 7.0, 6.0], [0.3, 0.380952380952381, 0.4, 0.36363636363636365, 11.0, 4.0, 6.0, 7.0], [0.4, 0.3157894736842105, 0.375, 0.2727272727272727, 11.0, 3.0, 5.0, 8.0], [0.5, 0.2222222222222222, 0.2857142857142857, 0.18181818181818182, 11.0, 2.0, 5.0, 9.0], [0.6, 0.23529411764705885, 0.3333333333333333, 0.18181818181818182, 11.0, 2.0, 4.0, 9.0], [0.7, 0.23529411764705885, 0.3333333333333333, 0.18181818181818182, 11.0, 2.0, 4.0, 9.0], [0.8, 0.23529411764705885, 0.3333333333333333, 0.18181818181818182, 11.0, 2.0, 4.0, 9.0], [0.9, 0.25000000000000006, 0.4, 0.18181818181818182, 11.0, 2.0, 3.0, 9.0]], "where": [[0.1, 0.48148148148148145, 0.416, 0.5714285714285714, 91.0, 52.0, 73.0, 39.0], [0.2, 0.4975609756097561, 0.4473684210526316, 0.5604395604395604, 91.0, 51.0, 63.0, 40.0], [0.3, 0.5025125628140703, 0.46296296296296297, 0.5494505494505495, 91.0, 50.0, 58.0, 41.0], [0.4, 0.5076142131979696, 0.4716981132075472, 0.5494505494505495, 91.0, 50.0, 56.0, 41.0], [0.5, 0.5077720207253886, 0.4803921568627451, 0.5384615384615384, 91.0, 49.0, 53.0, 42.0], [0.6, 0.5157894736842106, 0.494949494949495, 0.5384615384615384, 91.0, 49.0, 50.0, 42.0], [0.7, 0.5161290322580645, 0.5052631578947369, 0.5274725274725275, 91.0, 48.0, 47.0, 43.0], [0.8, 0.5193370165745856, 0.5222222222222223, 0.5164835164835165, 91.0, 47.0, 43.0, 44.0], [0.9, 0.5084745762711864, 0.5232558139534884, 0.4945054945054945, 91.0, 45.0, 41.0, 46.0]]}, "age": {"CM": [[5813, 0], [13, 2]], "Classification Report": {"0": {"precision": 0.9977686234122898, "recall": 1.0, "f1-score": 0.9988830655554601, "support": 5813}, "1": {"precision": 1.0, "recall": 0.13333333333333333, "f1-score": 0.23529411764705882, "support": 15}, "accuracy": 0.9977693891557996, "macro avg": {"precision": 0.9988843117061449, "recall": 0.5666666666666667, "f1-score": 0.6170885916012595, "support": 5828}, "weighted avg": {"precision": 0.9977743664886136, "recall": 0.9977693891557996, "f1-score": 0.9969177542619416, "support": 5828}}, "SQuAD_EM": 98.0, "SQuAD_F1": 98.0, "SQuAD_total": 600.0, "SQuAD_Pos. EM": 14.285714285714286, "SQuAD_Pos. F1": 14.285714285714286, "SQuAD_Pos. EM_F1_total": 14.0, "F1": 0.4, "P": 0.8, "R": 0.26666666666666666, "TP": 4.0, "FP": 1.0, "FN": 11.0, "N": 15.0}, "close_contact": {"CM": [[5769, 20], [27, 12]], "Classification Report": {"0": {"precision": 0.9953416149068323, "recall": 0.9965451718776991, "f1-score": 0.9959430297798878, "support": 5789}, "1": {"precision": 0.375, "recall": 0.3076923076923077, "f1-score": 0.3380281690140845, "support": 39}, "accuracy": 0.9919354838709677, "macro avg": {"precision": 0.6851708074534162, "recall": 0.6521187397850035, "f1-score": 0.6669855993969862, "support": 5828}, "weighted avg": {"precision": 0.9911903927068724, "recall": 0.9919354838709677, "f1-score": 0.9915403737109334, "support": 5828}}, "SQuAD_EM": 94.0495867768595, "SQuAD_F1": 94.0495867768595, "SQuAD_total": 605.0, "SQuAD_Pos. EM": 38.70967741935484, "SQuAD_Pos. F1": 38.70967741935484, "SQuAD_Pos. EM_F1_total": 31.0, "F1": 0.31707317073170727, "P": 0.3023255813953488, "R": 0.3333333333333333, "TP": 13.0, "FP": 30.0, "FN": 26.0, "N": 39.0}, "employer": {"CM": [[5727, 24], [54, 23]], "Classification Report": {"0": {"precision": 0.9906590555267255, "recall": 0.9958268127282212, "f1-score": 0.9932362122788762, "support": 5751}, "1": {"precision": 0.48936170212765956, "recall": 0.2987012987012987, "f1-score": 0.37096774193548393, "support": 77}, "accuracy": 0.9866163349347975, "macro avg": {"precision": 0.7400103788271926, "recall": 0.64726405571476, "f1-score": 0.68210197710718, "support": 5828}, "weighted avg": {"precision": 0.9840358749825031, "recall": 0.9866163349347975, "f1-score": 0.9850147517063914, "support": 5828}}, "SQuAD_EM": 90.60955518945634, "SQuAD_F1": 90.93172249679662, "SQuAD_total": 607.0, "SQuAD_Pos. EM": 38.095238095238095, "SQuAD_Pos. F1": 41.19929453262787, "SQuAD_Pos. EM_F1_total": 63.0, "F1": 0.4246575342465754, "P": 0.4492753623188406, "R": 0.4025974025974026, "TP": 31.0, "FP": 38.0, "FN": 46.0, "N": 77.0}, "gender_male": {"CM": [[5661, 41], [45, 81]], "Classification Report": {"0": {"precision": 0.9921135646687698, "recall": 0.992809540512101, "f1-score": 0.992461430575035, "support": 5702}, "1": {"precision": 0.6639344262295082, "recall": 0.6428571428571429, "f1-score": 0.6532258064516129, "support": 126}, "accuracy": 0.9852436513383666, "macro avg": {"precision": 0.828023995449139, "recall": 0.817833341684622, "f1-score": 0.8228436185133239, "support": 5828}, "weighted avg": {"precision": 0.9850184082783534, "recall": 0.9852436513383666, "f1-score": 0.9851272355442267, "support": 5828}}, "SQuAD_EM": 88.56209150326798, "SQuAD_F1": 89.35457516339872, "SQuAD_total": 612.0, "SQuAD_Pos. EM": 67.5, "SQuAD_Pos. F1": 71.54166666666667, "SQuAD_Pos. EM_F1_total": 120.0, "F1": 0.6486486486486486, "P": 0.75, "R": 0.5714285714285714, "TP": 72.0, "FP": 24.0, "FN": 54.0, "N": 126.0}, "gender_female": {"CM": [[5771, 10], [17, 30]], "Classification Report": {"0": {"precision": 0.9970628887353145, "recall": 0.9982701954679122, "f1-score": 0.9976661768519319, "support": 5781}, "1": {"precision": 0.75, "recall": 0.6382978723404256, "f1-score": 0.6896551724137931, "support": 47}, "accuracy": 0.9953671928620453, "macro avg": {"precision": 0.8735314443676572, "recall": 0.8182840339041688, "f1-score": 0.8436606746328625, "support": 5828}, "weighted avg": {"precision": 0.9950704460842231, "recall": 0.9953671928620453, "f1-score": 0.9951822171387211, "support": 5828}}, "SQuAD_EM": 96.69421487603306, "SQuAD_F1": 96.91263282172373, "SQuAD_total": 605.0, "SQuAD_Pos. EM": 68.18181818181819, "SQuAD_Pos. F1": 71.18506493506494, "SQuAD_Pos. EM_F1_total": 44.0, "F1": 0.6829268292682927, "P": 0.8, "R": 0.5957446808510638, "TP": 28.0, "FP": 7.0, "FN": 19.0, "N": 47.0}, "name": {"CM": [[5254, 103], [140, 331]], "Classification Report": {"0": {"precision": 0.9740452354467928, "recall": 0.9807728206085495, "f1-score": 0.9773974513998698, "support": 5357}, "1": {"precision": 0.7626728110599078, "recall": 0.70276008492569, "f1-score": 0.7314917127071824, "support": 471}, "accuracy": 0.9583047357584077, "macro avg": {"precision": 0.8683590232533502, "recall": 0.8417664527671198, "f1-score": 0.854444582053526, "support": 5828}, "weighted avg": {"precision": 0.9569628037573242, "recall": 0.9583047357584077, "f1-score": 0.9575241495940606, "support": 5828}}, "SQuAD_EM": 72.98507462686567, "SQuAD_F1": 77.11334271035766, "SQuAD_total": 670.0, "SQuAD_Pos. EM": 69.4560669456067, "SQuAD_Pos. F1": 75.24255149778163, "SQuAD_Pos. EM_F1_total": 478.0, "F1": 0.7318181818181818, "P": 0.7872860635696821, "R": 0.6836518046709129, "TP": 322.0, "FP": 87.0, "FN": 149.0, "N": 471.0}, "recent_travel": {"CM": [[5791, 3], [26, 8]], "Classification Report": {"0": {"precision": 0.9955303421007392, "recall": 0.9994822229892992, "f1-score": 0.9975023684437172, "support": 5794}, "1": {"precision": 0.7272727272727273, "recall": 0.23529411764705882, "f1-score": 0.3555555555555555, "support": 34}, "accuracy": 0.9950240219629375, "macro avg": {"precision": 0.8614015346867332, "recall": 0.617388170318179, "f1-score": 0.6765289619996364, "support": 5828}, "weighted avg": {"precision": 0.9939653525838978, "recall": 0.9950240219629375, "f1-score": 0.993757311539428, "support": 5828}}, "SQuAD_EM": 96.66666666666667, "SQuAD_F1": 96.66666666666667, "SQuAD_total": 600.0, "SQuAD_Pos. EM": 30.76923076923077, "SQuAD_Pos. F1": 30.76923076923077, "SQuAD_Pos. EM_F1_total": 26.0, "F1": 0.44000000000000006, "P": 0.6875, "R": 0.3235294117647059, "TP": 11.0, "FP": 5.0, "FN": 23.0, "N": 34.0}, "relation": {"CM": [[5808, 8], [6, 6]], "Classification Report": {"0": {"precision": 0.9989680082559339, "recall": 0.9986244841815681, "f1-score": 0.9987962166809975, "support": 5816}, "1": {"precision": 0.42857142857142855, "recall": 0.5, "f1-score": 0.4615384615384615, "support": 12}, "accuracy": 0.9975978037062457, "macro avg": {"precision": 0.7137697184136812, "recall": 0.7493122420907841, "f1-score": 0.7301673391097295, "support": 5828}, "weighted avg": {"precision": 0.9977935472133439, "recall": 0.9975978037062457, "f1-score": 0.9976899893196883, "support": 5828}}, "SQuAD_EM": 98.16971713810317, "SQuAD_F1": 98.16971713810317, "SQuAD_total": 601.0, "SQuAD_Pos. EM": 66.66666666666667, "SQuAD_Pos. F1": 66.66666666666667, "SQuAD_Pos. EM_F1_total": 9.0, "F1": 0.5, "P": 0.4, "R": 0.6666666666666666, "TP": 8.0, "FP": 12.0, "FN": 4.0, "N": 12.0}, "when": {"CM": [[5812, 3], [9, 4]], "Classification Report": {"0": {"precision": 0.9984538739048273, "recall": 0.9994840928632847, "f1-score": 0.9989687177724303, "support": 5815}, "1": {"precision": 0.5714285714285714, "recall": 0.3076923076923077, "f1-score": 0.4, "support": 13}, "accuracy": 0.9979409746053535, "macro avg": {"precision": 0.7849412226666994, "recall": 0.6535882002777962, "f1-score": 0.6994843588862152, "support": 5828}, "weighted avg": {"precision": 0.9975013466343758, "recall": 0.9979409746053535, "f1-score": 0.9976326516552303, "support": 5828}}, "SQuAD_EM": 98.00332778702163, "SQuAD_F1": 98.00332778702163, "SQuAD_total": 601.0, "SQuAD_Pos. EM": 30.76923076923077, "SQuAD_Pos. F1": 30.76923076923077, "SQuAD_Pos. EM_F1_total": 13.0, "F1": 0.3571428571428571, "P": 0.3333333333333333, "R": 0.38461538461538464, "TP": 5.0, "FP": 10.0, "FN": 8.0, "N": 13.0}, "where": {"CM": [[5594, 84], [67, 83]], "Classification Report": {"0": {"precision": 0.9881646352234588, "recall": 0.9852060584712927, "f1-score": 0.9866831290237235, "support": 5678}, "1": {"precision": 0.49700598802395207, "recall": 0.5533333333333333, "f1-score": 0.5236593059936909, "support": 150}, "accuracy": 0.9740905971173645, "macro avg": {"precision": 0.7425853116237054, "recall": 0.7692696959023131, "f1-score": 0.7551712175087072, "support": 5828}, "weighted avg": {"precision": 0.9755232836311585, "recall": 0.9740905971173645, "f1-score": 0.974765906399409, "support": 5828}}, "SQuAD_EM": 80.41074249605056, "SQuAD_F1": 81.40512550465158, "SQuAD_total": 633.0, "SQuAD_Pos. EM": 56.0, "SQuAD_Pos. F1": 60.196296296296275, "SQuAD_Pos. EM_F1_total": 150.0, "F1": 0.5228758169934641, "P": 0.5128205128205128, "R": 0.5333333333333333, "TP": 80.0, "FP": 76.0, "FN": 70.0, "N": 150.0}}

Thanks, Xiangyu

s4zong commented 4 years ago

Emmm so I am not sure where is the issue. I just downloaded the github repo and rerun it. It seems I can get similar results based on the instructions I wrote.

MY REPRODUCE

[I] number of tweets 2423 for positive

age 15.0 0.6666666666666666
close_contact 40.0 0.3174603174603175
employer 80.0 0.38759689922480617
gender_male 129.0 0.6612903225806451
gender_female 50.0 0.6744186046511629
name 479.0 0.7817047817047817
recent_travel 35.0 0.3529411764705882
relation 11.0 0.6666666666666667
when 14.0 0.4444444444444445
where 149.0 0.4888888888888889

YOUR REPRODUCE

age 15.0 0.4
close_contact 39.0 0.31707317073170727
employer 77.0 0.4246575342465754
gender_male 126.0 0.6486486486486486
gender_female 47.0 0.6829268292682927
name 471.0 0.7318181818181818
recent_travel 34.0 0.44000000000000006
relation 12.0 0.5
when 13.0 0.3571428571428571
where 150.0 0.5228758169934641

So the differences are for age, when, recent_travel and relation, all with very few evaluation chunks in the test set. By the way we may not want to use macro-F1 here, as the number of chunks varies a lot between each question. We may want to use micro-F1.

Glovesme commented 4 years ago

Thanks. In my case, the number of downloaded tweets for tested positive is 2400, could this be the reason that the results are slightly different?

bekou commented 4 years ago

Emmm so I am not sure where is the issue. I just downloaded the github repo and rerun it. It seems I can get similar results based on the instructions I wrote.
MY REPRODUCE

[I] number of tweets 2423 for positive

age 15.0 0.6666666666666666
close_contact 40.0 0.3174603174603175
employer 80.0 0.38759689922480617
gender_male 129.0 0.6612903225806451
gender_female 50.0 0.6744186046511629
name 479.0 0.7817047817047817
recent_travel 35.0 0.3529411764705882
relation 11.0 0.6666666666666667
when 14.0 0.4444444444444445
where 149.0 0.4888888888888889
YOUR REPRODUCE

age 15.0 0.4
close_contact 39.0 0.31707317073170727
employer 77.0 0.4246575342465754
gender_male 126.0 0.6486486486486486
gender_female 47.0 0.6829268292682927
name 471.0 0.7318181818181818
recent_travel 34.0 0.44000000000000006
relation 12.0 0.5
when 13.0 0.3571428571428571
where 150.0 0.5228758169934641
So the differences are for age, when, recent_travel and relation, all with very few evaluation chunks in the test set. By the way we may not want to use macro-F1 here, as the number of chunks varies a lot between each question. We may want to use micro-F1.

@viczong : The issue is that in terms of macro-avg F1 score, there was a difference of 4 percentage points regarding the code given at your github codebase and the results presented in the paper. I see your point for using the micro-f1 score. However, when you have highly imbalance data, maybe it is also important to look at the F1 scores of the under-represented classes as well.

So, for the competition, the ranking will be based on the micro F1 score?

Best, Giannis

s4zong commented 4 years ago

So I think the differences in the dataset cause different F1 scores for subtasks. And these differences could be somehow enlarged by using macro-average. For example, if the model gets 4 more correct for the age slot in the test set, then it would cause an increase of around 3 percentage in marco-F1. In this sense macro-F1 might not be suitable.

Yes I do agree with you that these metrics have their own pros and cons when dealing with highly imbalanced data, if we want to merge several performances into one single number. So we are working on a draft of detailed evaluation plan these days that will soon be released. We might use several ways to rank, e.g., rank all by micro-F1 and also rank by sub-classes.

bekou commented 4 years ago

Great, thanks for your response. I do agree that different metrics have their pros and cons. So, we will stick with micro F1 for now, since it is the one that you use as well. Could you please give us some high-level information regarding the ranking by sub-classes? Are you going to give some weights to some of the classes or to categories of events?

s4zong commented 4 years ago

Although not decided yet, I guess we would just generate a separate rank for each specific subtask. After we release the evaluation plan, feel free to let us know what you think should be the best way to evaluate at that time.

s4zong / extract_COVID19_events_from_Twitter

Issue reproducing the results of BERT model #11