ncbi / BioRED

19 stars 4 forks source link

how did BioRED process the relations span in multiple sentences? #7

Open Meiling-Sun opened 4 months ago

Meiling-Sun commented 4 months ago

Hi, thanks for this amazing work. i have some questions. The annotation is base on abstract level. but when you use PubMedBERT model for relation extraction, how do tokenizers do the sentence segmentation? As i know max token of BERT is 512. So how do you proceed if the token length of one abstract bigger than 512? Another question is when you do annotation, how about the coreference examples? Did you also annotate pronoun like, 'it', 'this' also as entity? do they become noises for NER task? Before do RE task, do you change them as original entity names or keep them or any other strategies?

ptlai commented 4 months ago

Hi @Meiling-Sun, We don't deal with the token length of one abstract larger than 512 in the PubMedBERT model. If you would like to do this, you may consider to use the "stride" parameter of huggingface's tokenizer.

No, our BioRED corpus doesn't contain pronoun annotations, so they are not used in NER and RE. In our dataset, coreference cases are those entities which have the same database identifier, e.g. MESH or Entrez ID. For the RE task, I don't normalize the entities in text, instead I inserts special tokens to tag those entities in the text.

Meiling-Sun commented 4 months ago

Thank you very much for the reply :)