naver / biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining
667 stars 88 forks source link

Files for BioBERT tokenizer #11

Closed anjani-dhrangadhariya closed 4 years ago

anjani-dhrangadhariya commented 5 years ago

In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.

tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')

These are the files generated when one saves the developed tokenizer using the following command.

tokenizer.save_pretrained('./my_saved_biobert_model_directory/')​

This should save files like,

The file names are,

  1. added_token.json
  2. special_tokens_map.json
  3. tokenizer_config.json

However, I am not able to find these files from these pretrained BioBERT weights directory.

From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?

I will be grateful for your response.

hdatteln commented 4 years ago

I would be interested in this question, too; Did you ever find out more about it?

anjani-dhrangadhariya commented 4 years ago

I would be interested in this question, too; Did you ever find out more about it?

I had a deadline so I used BERT, but I will delve into it again.

jhyuklee commented 4 years ago

Hi, sorry for the inconvenience. The BERT tokenizer is exactly the same as BioBERT tokenizer. The files you are mentioning seem to be a newer version of BERT's vocabulary, which will be incompatible unless you modify the code. You can just use BioBERT's vocabulary provided along with the pre-trained BioBERT files.

hdatteln commented 4 years ago

Thank you, @jhyuklee ! Yeah, that's what i did in the end, and it seems to be working ok: the_tokenizer = BertTokenizer.from_pretrained('biobert_f/biobert_v1.1_pubmed/vocab.txt')