Evaluation metric in Table 2 in the paper

longxudou / HIT-SCIR-CoNLL2019

"HIT-SCIR at MRP 2019: A Unified Pipeline for Meaning Representation Parsing via Efficient Training and Effective Encoding"-1st system in CoNLL2019 shared task

https://www.aclweb.org/anthology/K19-2007.pdf

Apache License 2.0

27 stars 14 forks source link

Evaluation metric in Table 2 in the paper #11

Closed danielhers closed 4 years ago

danielhers commented 4 years ago

I'm trying to understand Table 2 in the CoNLL 2019 paper: Since BERT(base) is used for all models in the submission (maybe except AMR? #4), shouldn't the MRP scores be the same as the scores from Table 1? Is it the same metric ("cross-framework evaluation metric", "ALL-F1")? For example, the UCCA MRP score of 92.8 with BERT and 87.5 with GloVe are much higher than the official score of 81.67.

longxudou commented 4 years ago

They are using the same evaluation metric(MRP ALL-F1), but evaluated on different dataset.

Table 1: Train on MRP training data, Test on MRP testing data. (conll official setting) Table 2: We split the MRP training data into train/dev/test (8:1:1) to conduct preliminary experiment. That is to say, we train on 80% MRP training data, Test on 10% MRP training data.

I'm sorry that the paper didn't illustrate this clearly, only mention the 'MRP spilt dataset' saying in Table 2 caption.

danielhers commented 4 years ago

Understood, thanks!