naver / biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining
667 stars 88 forks source link

BIOBERT corpus #17

Closed etetteh closed 4 years ago

etetteh commented 4 years ago

Congratulations on the BIOBERT work. I am trying to train BIOBERT from scratch with slight modifications. The vocab.txt is the original. Please, can you provide me with help on getting the PubMed Abstracts, PMC full articles texts and the original BERT corpus? I can't seem to find a way in getting these files.

jhyuklee commented 4 years ago

Hi, did you check the README of our repository? There are links to the corpora. Regarding the original BERT corpus, please check from google's repository. Thanks.

etetteh commented 4 years ago

Thank you for your reply, but the only links I'm seeing those for the downstream tasks not the pubmed data itself.

On Fri, Feb 21, 2020, 00:52 Jinhyuk Lee notifications@github.com wrote:

Hi, did you check the README of our repository? There are links to the corpora. Regarding the original BERT corpus, please check from google's repository. Thanks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/naver/biobert-pretrained/issues/17?email_source=notifications&email_token=AGZQ72H6RWHWK3NI3P7YZK3RD4QTRA5CNFSM4KYRKLRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMRDQ4I#issuecomment-589445233, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZQ72EFWHYS5KVKEEICK5TRD4QTRANCNFSM4KYRKLRA .

jhyuklee commented 4 years ago

Please check here. Thanks.

LivC193 commented 4 years ago

Hi, This is the function I use to get Abstracts as well as other information:

try:
         handle = Entrez.efetch(db = 'pubmed',
                                retmode = 'text',
                                rettype = 'Abstract',
                                id = pubmed_id,
                                api_key = 'your api key')

         results = handle.read()
         rec_file = io.StringIO(results)
         medline_rec = Medline.read(rec_file)

         if 'AB' in medline_rec.keys():
             ab = medline_rec['AB']

AB is for abstract. You need your own api_key from pubmed and also read usage restrictions.

etetteh commented 4 years ago

Thanks so much for sharing. I download a file from pubmed that contains PMID(Pubmed ID), and found a way to download them in chunks using rentrez. I wish I could download the number of Abstracts at one time, but it's all well.

I really appreciate the response from you

On Wed, Feb 26, 2020, 00:10 LivC182 notifications@github.com wrote:

Hi, This is the function I use to get Abstracts as well as other information:

try: handle = Entrez.efetch(db = 'pubmed', retmode = 'text', rettype = 'Abstract', id = pubmed_id, api_key = 'your api key')

     results = handle.read()
     rec_file = io.StringIO(results)
     medline_rec = Medline.read(rec_file)

     if 'AB' in medline_rec.keys():
         ab = medline_rec['AB']

AB is for abstract. You need your own api_key from pubmed and also read usage restrictions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/naver/biobert-pretrained/issues/17?email_source=notifications&email_token=AGZQ72EAZFQATQ2X2FRIJTLREWXPXA5CNFSM4KYRKLRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6E7HQ#issuecomment-591155102, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZQ72EPAMP4V2YW2YO5MVLREWXPXANCNFSM4KYRKLRA .