Folks, I am trying to convert the Biobert model to Pytorch. Here are the things that I did so far:
1. For the vocab: I am trying to convert the vocab using solution from #69 :
tokenizerĀ =Ā BartTokenizer.from_pretrained('/content/biobert_v1.1_pubmed/vocab.txt')
I get :
OSError: Model name '/content/biobert_v1.1_pubmed' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed '/content/biobert_v1.1_pubmed' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I donāt have the vocab.json, so I how do I convert the vocab for the tokenizer ?
2. For the model: As the out of the box pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch did not work I customized it per #2 by adding:
excluded = ['BERTAdam','_power','global_step']
init_vars = list(filter(lambda x:all([True if e not in x[0] else False for e in excluded]),init_vars))
With this the model 'seems' to be converting fine. But When I load this using:
model = BartForConditionalGeneration.from_pretrained('path/to/model/biobert_v1.1_pubmed_pytorch.model')
I still get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Can you pl. help me to understand what is going on here ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
š Bug
Folks, I am trying to convert the Biobert model to Pytorch. Here are the things that I did so far:
1. For the vocab: I am trying to convert the vocab using solution from #69 :
tokenizerĀ =Ā BartTokenizer.from_pretrained('/content/biobert_v1.1_pubmed/vocab.txt')
I get :
OSError: Model name '/content/biobert_v1.1_pubmed' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed '/content/biobert_v1.1_pubmed' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I donāt have the vocab.json, so I how do I convert the vocab for the tokenizer ?
2. For the model: As the out of the box
pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch
did not work I customized it per #2 by adding:With this the model 'seems' to be converting fine. But When I load this using:
model = BartForConditionalGeneration.from_pretrained('path/to/model/biobert_v1.1_pubmed_pytorch.model')
I still get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Can you pl. help me to understand what is going on here ?