Domain Specific Pre-training Model

naver / biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

659 stars 87 forks source link

Domain Specific Pre-training Model #4

Closed abhinandansrivastava closed 5 years ago

abhinandansrivastava commented 5 years ago

Hi,

I have run the code run_pretraining.py script on my domain specific data.

It seems like only checkpoints are saved. I have got two files 0000020.params and 0000020.states.

How can I save the model or get a model from .params and .states files in checkpoint folder so that I can use that model to get contextual embeddings.

Can someone please help me with this?

jhyuklee commented 5 years ago

Hi,

the run_pretraining.py script is exactly the same as https://github.com/google-research/bert, and you can get help from there. We used our modified version of the script (which is not shared) to handle multi-gpu and server specific issues for saving the models, so the result might be quite different from what you'll get using the original script.

Thank you.

abhinandansrivastava commented 5 years ago

Hi, After running the BERT Model, I am getting embedding for each word in a sentence, But need to get the sentence embedding. How to find that?

Thanks

Sriyella commented 5 years ago

Hi,

Is there anyway to load this model in tensorflow hub.module()? If not, how can we use the model to get the embeddings?

Please suggest the way forward

jhyuklee commented 5 years ago

Hi, @abhinandansrivastava, you can use [CLS] token for sentence embedding or classification. Thanks.

jhyuklee commented 5 years ago

Hi @Sriyella, We haven't tried using hub.module(). You can just get the last layer of BERT (or BioBERT), and save them.

jhyuklee commented 5 years ago

If it's not related to pre-trained weights of BioBERT, please report BioBERT related issues in https://github.com/dmis-lab/biobert, or BERT related issues in https://github.com/google-research/bert.

pyturn commented 5 years ago

Hi,

Is there anyway to load this model in tensorflow hub.module()? If not, how can we use the model to get the embeddings?

Please suggest the way forward

I am also looking for the same. How to use the pre-trained weigths to get the embeddings.

jhyuklee commented 5 years ago

This might help! https://github.com/google-research/bert/issues/60

abhinandansrivastava commented 5 years ago

Hi @jhyuklee , Thanks for the reply.

Do we need to create our own vocab.txt after doing pretraining of domain specific model, as the model saved after the pretraining process does not have vocab.txt and bert_config.json file.

If yes, then how?

Thanks

jhyuklee commented 5 years ago

Hi @abhinandansrivastava,

you don't have to create your own vocab.txt if you used the same vocab.txt and bert_config.json while pre-training. See https://github.com/naver/biobert-pretrained/issues/1.

Thanks.

jhyuklee commented 5 years ago

Embedding related issues are at https://github.com/dmis-lab/biobert/issues/23. Closing this issue.

abhinandansrivastava commented 5 years ago

Hi @jhyuklee , BioBert Vocab.txt file and Bert Uncased Vocab.txt file are different. How you have added new tokenised words into Biobert Vocab.txt file as Some Biobert Vocab.txt file has different tokenised words compared with Uncased Bert Base Vocab.txt

jhyuklee commented 5 years ago

Hi @abhinandansrivastava , we used Bert-base Cased vocabulary as uppercase often matters in biomedical texts. Thanks.