How to generate or convert vocab.json, merges.txt, and config.json to match huggingface/transformers requirements ?

lopuhin / transformer-lm

Transformer language model (GPT-2) with sentencepiece tokenizer

164 stars 47 forks source link

How to generate or convert vocab.json, merges.txt, and config.json to match huggingface/transformers requirements ? #29

Open ycat3 opened 4 years ago

ycat3 commented 4 years ago

Hi, Great work !! I am migrating to GPT2 japanese version for pytorch Pretrained GPT2 japanese is working fine. Thanks a lot. However, I miss vocab.json, merges.txt and config.json files in the run-root directory. Let me know the suggestion how to do it.

Best regards.

lopuhin commented 4 years ago

Thanks @ycat3 !

I miss vocab.json, merges.txt and config.json files in the run-root directory.

Sorry I might be not familiar with these files. To clarify this, could you please tell from which repo and to which repo are you trying to migrate?

ycat3 commented 4 years ago

Thank you for quick reply. There is German GPT-2 repo in the huggingface/transformers. https://huggingface.co/anonymous-german-nlp/german-gpt2 This repo might be your fork.

lopuhin commented 4 years ago

Thanks for the pointers - that would essentially mean converting the model to huggingface/transformers format, right? That would be a great feature to add and looks feasible to do, but it's not currently supported.

ycat3 commented 4 years ago

I know huggingface/tokenizers generates merges.txt and vocab.json. https://github.com/huggingface/tokenizers

However I doubt is this solution. Anyway I feel your repo is active. I will push pull request when I need.

Best regards.