🐛 Bug

Folks, I am trying to convert the Biobert model to Pytorch. Here are the things that I did so far:

1. For the vocab: I am trying to convert the vocab using solution from #69 : tokenizer = BartTokenizer.from_pretrained('/content/biobert_v1.1_pubmed/vocab.txt')

I get : OSError: Model name '/content/biobert_v1.1_pubmed' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed '/content/biobert_v1.1_pubmed' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

I don’t have the vocab.json, so I how do I convert the vocab for the tokenizer ?

2. For the model: As the out of the box pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch did not work I customized it per #2 by adding:

excluded = ['BERTAdam','_power','global_step']
init_vars = list(filter(lambda x:all([True if e not in x[0] else False for e in excluded]),init_vars))

With this the model 'seems' to be converting fine. But When I load this using:

model = BartForConditionalGeneration.from_pretrained('path/to/model/biobert_v1.1_pubmed_pytorch.model')

I still get

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Can you pl. help me to understand what is going on here ?

huggingface / transformers

Converting model to pytorch #4712

🐛 Bug