ncbi-nlp / BLUE_Benchmark

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.
https://arxiv.org/abs/1906.05474
Other
286 stars 40 forks source link

Why ChemProt is evaluated under micro F1? #14

Closed mengzaiqiao closed 3 years ago

mengzaiqiao commented 3 years ago

Dear authors,

Happy new year! Thanks for sharing these datasets.

I am a bit confused with the ChemProt dataset, where the micro-average F1 is used for evaluation. However, in this dataset (bert_data/ChemProt/), each entity pair (row) in a sentence only contains one relation label, so it is a multi-class classification task that should be evaluated by macro-average F1 or weighted-average F1. I also see the original paper [1] indeed uses the macro-average F1 as the evaluation metric. Did you regard all relation labels of a sentence as a single instance during your evaluation in this benchmark? Why?

[1] Chem-Prot: Peng et al. 2018. Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database: the journal of biological databases and curation, 2018.

Your prompt response will be highly appreciated.

Best, Zaiqiao

yfpeng commented 3 years ago

Hi Zaiqiao,

I think ChemProt used micro-F1. Please check their official evaluation script at http://www.biocreative.org/media/store/files/2017/evaluation-kit.zip.

In our paper, we also used micro-F1 for each fold and reported macro-F1 on 5-fold cross validation.

Best, Yifan

mengzaiqiao commented 3 years ago

Hi Yifan,

Thanks very much for your prompt reply. That makes me understand the paper much better.

I see in your codes of this repository, only one split of train/dev/test sets is generated. So I guess the permanences of this benchmark need to be reported based on the micro-F1 of this single split? Am I right?

I also have another question. When I look at the ChemProt dataset in data_v0.2.zip, the numbers of instances in (train.tsv, dev.tsv, test.tsv) are (19460, 11820, 16943) respectively, which are different from the numbers showing in the table. Is that correct?

Best, Zaiqiao

yfpeng commented 3 years ago

The file includes both positive and negative instances, while the table only shows positive cases.

mengzaiqiao commented 3 years ago

Got it. Pretty much thanks.