huggingface / transformers

šŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.89k stars 26.78k forks source link

Converting model to pytorch #4712

Closed rlpatrao closed 4 years ago

rlpatrao commented 4 years ago

šŸ› Bug

Folks, I am trying to convert the Biobert model to Pytorch. Here are the things that I did so far:

1. For the vocab: I am trying to convert the vocab using solution from #69 : tokenizerĀ =Ā BartTokenizer.from_pretrained('/content/biobert_v1.1_pubmed/vocab.txt')

I get : OSError: Model name '/content/biobert_v1.1_pubmed' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed '/content/biobert_v1.1_pubmed' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

I donā€™t have the vocab.json, so I how do I convert the vocab for the tokenizer ?

2. For the model: As the out of the box pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch did not work I customized it per #2 by adding:

excluded = ['BERTAdam','_power','global_step']
init_vars = list(filter(lambda x:all([True if e not in x[0] else False for e in excluded]),init_vars))

With this the model 'seems' to be converting fine. But When I load this using:

model = BartForConditionalGeneration.from_pretrained('path/to/model/biobert_v1.1_pubmed_pytorch.model')

I still get

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Can you pl. help me to understand what is going on here ?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.