Closed anjani-dhrangadhariya closed 4 years ago
I would be interested in this question, too; Did you ever find out more about it?
I would be interested in this question, too; Did you ever find out more about it?
I had a deadline so I used BERT, but I will delve into it again.
Hi, sorry for the inconvenience. The BERT tokenizer is exactly the same as BioBERT tokenizer. The files you are mentioning seem to be a newer version of BERT's vocabulary, which will be incompatible unless you modify the code. You can just use BioBERT's vocabulary provided along with the pre-trained BioBERT files.
Thank you, @jhyuklee ! Yeah, that's what i did in the end, and it seems to be working ok: the_tokenizer = BertTokenizer.from_pretrained('biobert_f/biobert_v1.1_pubmed/vocab.txt')
In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.
tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')
These are the files generated when one saves the developed tokenizer using the following command.
tokenizer.save_pretrained('./my_saved_biobert_model_directory/')
This should save files like,
The file names are,
However, I am not able to find these files from these pretrained BioBERT weights directory.
From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?
I will be grateful for your response.