Files for BioBERT tokenizer

anjani-dhrangadhariya commented 5 years ago

In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.

tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')

These are the files generated when one saves the developed tokenizer using the following command.

tokenizer.save_pretrained('./my_saved_biobert_model_directory/')

This should save files like,

The file names are,

added_token.json
special_tokens_map.json
tokenizer_config.json

However, I am not able to find these files from these pretrained BioBERT weights directory.

From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?

I will be grateful for your response.

hdatteln commented 4 years ago

I would be interested in this question, too; Did you ever find out more about it?

anjani-dhrangadhariya commented 4 years ago

I would be interested in this question, too; Did you ever find out more about it?

I had a deadline so I used BERT, but I will delve into it again.

jhyuklee commented 4 years ago

Hi, sorry for the inconvenience. The BERT tokenizer is exactly the same as BioBERT tokenizer. The files you are mentioning seem to be a newer version of BERT's vocabulary, which will be incompatible unless you modify the code. You can just use BioBERT's vocabulary provided along with the pre-trained BioBERT files.

hdatteln commented 4 years ago

Thank you, @jhyuklee ! Yeah, that's what i did in the end, and it seems to be working ok: the_tokenizer = BertTokenizer.from_pretrained('biobert_f/biobert_v1.1_pubmed/vocab.txt')

naver / biobert-pretrained

Files for BioBERT tokenizer #11