thunlp / OpenNRE

An Open-Source Package for Neural Relation Extraction (NRE)
MIT License
4.34k stars 1.05k forks source link

NYT10m Experiments -- Manual Evaluation Matters #348

Open suamin opened 3 years ago

suamin commented 3 years ago

Hi,

Thank you for the latest contribution "Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction", having manual test significantly improves our understanding of the DRE models. I have few questions re the paper's experiments:

Q1: Is it possible to provide the pre-trained checkpoints for BERT+sent/bag+AVG models?

Q2: Regarding evaluation, it is mentioned in paper:

Bag-level manual evaluation: We take our
human-labeled test data for bag-level evaluation.
Since annotated data are at the sentence-level, we
construct bag-level annotations in the following
way: For each bag, if one sentence in the bag has
a human-labeled relation, this bag is labeled with
this relation; if no sentence in the bag is annotated
with any relation, this bag is labeled as N/A.

Can you elaborate this further, is this same as in current eval part of BagRELoader code? Unfortunately, I cannot find 'anno_relation_list' in the manually created test set, does this require additional pre-processing?

Q3: At evaluation (valid, test) time, the bag_size parameter should be set to 0 (so we consider all sentences in the Bag as also reported in paper -- but this is not handled in current BagRE framework) and entpair_as_bag to True?

Q4: Can you provide the scores for the NYT10m val set for the models reported in Table 4 of the paper? Do you also plan to provide P@k metrics and pr_curves for the models reported in Table 4?

Q5: Is BERT+sent level training performed with MultiLabelSentenceRE or simple SentenceRE?

Thank you in advance!

HenryPaik1 commented 3 years ago

@gaotianyu1350 Thanks for great work. I have same questions. @suamin Did you find the answers for your questions? As for NYT10m, I trained BERT with sentence level framework, and then test it by using bag level framework and multi label separately. The results shows that test with bag-level (60.6, 35.32) is better than multi label (58.39, 31.98). However, I still cannot reproduce the result on the paper.

suamin commented 3 years ago

@HenryPaik1 thanks for your input. I've not been able to find answers to the questions. I still struggle to reproduce paper numbers. For BERT+sent+AVG, I get AUC=55.45, macro-F1=21.12 on val and AUC=47.49, macro-F1=11.23 on test with Bag-level evaluation.