Closed mengzaiqiao closed 3 years ago
Hi Zaiqiao,
I think ChemProt used micro-F1. Please check their official evaluation script at http://www.biocreative.org/media/store/files/2017/evaluation-kit.zip.
In our paper, we also used micro-F1 for each fold and reported macro-F1 on 5-fold cross validation.
Best, Yifan
Hi Yifan,
Thanks very much for your prompt reply. That makes me understand the paper much better.
I see in your codes of this repository, only one split of train/dev/test sets is generated. So I guess the permanences of this benchmark need to be reported based on the micro-F1 of this single split? Am I right?
I also have another question. When I look at the ChemProt dataset in data_v0.2.zip, the numbers of instances in (train.tsv, dev.tsv, test.tsv) are (19460, 11820, 16943) respectively, which are different from the numbers showing in the table. Is that correct?
Best, Zaiqiao
The file includes both positive and negative instances, while the table only shows positive cases.
Got it. Pretty much thanks.
Dear authors,
Happy new year! Thanks for sharing these datasets.
I am a bit confused with the ChemProt dataset, where the micro-average F1 is used for evaluation. However, in this dataset (bert_data/ChemProt/), each entity pair (row) in a sentence only contains one relation label, so it is a multi-class classification task that should be evaluated by macro-average F1 or weighted-average F1. I also see the original paper [1] indeed uses the macro-average F1 as the evaluation metric. Did you regard all relation labels of a sentence as a single instance during your evaluation in this benchmark? Why?
[1] Chem-Prot: Peng et al. 2018. Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database: the journal of biological databases and curation, 2018.
Your prompt response will be highly appreciated.
Best, Zaiqiao