sy-wada / blue_benchmark_with_transformers

Implementation of the BLUE benchmark with Transformers.
https://arxiv.org/abs/2005.07202
Apache License 2.0
20 stars 2 forks source link

How to choose articles "closely related to human beings" #4

Closed usuyama closed 4 years ago

usuyama commented 4 years ago

Thanks for releasing this great repo!

I'm wondering about the criteria for the fP dataset. How did you choose "closely related to human beings"?

Focused PubMed abstracts (fP): articles more related to human beings.

We investigated whether the BERT model trained via our method using PubMed articles that were closely related to human beings (focused PubMed abstracts) as Convoy and using other PubMed abstracts as Escort would achieve better performance in biomedical text-mining tasks than those of the other BERT models.

Thanks, Naoto

sy-wada commented 4 years ago

Thank you for your question!

We used a primitive filter using the MeSH Tree Number for extracting articles in the paper. We are currently submitting it to a journal and will provide this filter, if necessary, as the peer review progresses.

The following is an overview. Using each MeSH ID, for example, for articles containing "Diseases [C]", this would cover about 11GB of the entire PubMed abstracts (20GB) we collected. Because PubMed articles contain a lot of basic research that target non-human subjects, we used MeSH Tree Structures to focus on articles that are likely related to human medicine. As examples, articles including Technology, Industry, and Agriculture [J], Information Science [L], Plant Structures [A18], Fungal Structures [A19], Bacterial Structures [A20], ..., or Viral Structures [A21] are excluded, resulting in focused Pubmed abstracts (1.8GB).

Please refer to MeSH Browser, too.

Thanks, Shoya

usuyama commented 4 years ago

Thanks for your prompt reply. Good luck your journal submission!