thunlp / FewRel

A Large-Scale Few-Shot Relation Extraction Dataset
https://thunlp.github.io/fewrel.html
MIT License
729 stars 165 forks source link

Can you elaborate more on the FewRel 2.0 DA dataset? #36

Closed tsaoallen closed 4 years ago

tsaoallen commented 4 years ago

The FewRel 2.0 DA dataset looks interesting as it can take quite a few effort to construct such a biomedical dataset. Can you help us understand more of this part:

1) Can you give an example on how you built the initial test set aligning PubMed and UMLS (e.g. how did you figure out the relations etc.)? 2) How many annotation effort is required in validating the initial test set (e.g., 100 annotator hours)? Is biomedical knowledge a prerequisite for the annotators? 3) In which platform is this annotation job executed (e.g., Amazon MTurk)?

Thanks,

Allen

gaotianyu1350 commented 4 years ago

Hi, thanks for your interest!

  1. We use simple name match to link entities in UMLS and mentions in PubMed. Since the medical entity names are usually complex, by simple matching we can get a good linking result. UMLS has provided relations between entities so we can follow the standard distant supervision procedure.

  2. Basically, it takes half to one minute to annotate one sentence. Compared to general domain dataset, the annotation process is harder, but not much medical background is needed. However, it does need some simple biomedical knowledge to correctly identify the relation. For example, knowing what cytoplasm is.

  3. This is a joint work with Tencent Wechat AI, so we used their platform.