Closed danielhers closed 4 years ago
They are using the same evaluation metric(MRP ALL-F1), but evaluated on different dataset.
Table 1: Train on MRP training data, Test on MRP testing data. (conll official setting) Table 2: We split the MRP training data into train/dev/test (8:1:1) to conduct preliminary experiment. That is to say, we train on 80% MRP training data, Test on 10% MRP training data.
I'm sorry that the paper didn't illustrate this clearly, only mention the 'MRP spilt dataset' saying in Table 2 caption.
Understood, thanks!
I'm trying to understand Table 2 in the CoNLL 2019 paper: Since BERT(base) is used for all models in the submission (maybe except AMR? #4), shouldn't the MRP scores be the same as the scores from Table 1? Is it the same metric ("cross-framework evaluation metric", "ALL-F1")? For example, the UCCA MRP score of 92.8 with BERT and 87.5 with GloVe are much higher than the official score of 81.67.