huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

How to create Tokenizer.json? #1410

Closed kenaii closed 5 months ago

kenaii commented 7 months ago

I have this tokenizer and I want to convert it to tokenizer.json format.

Is it possible to replace my tokenizer data with the original tokenizer.json?

import json

j = open('hf/tokenizer.json')
data = json.load(j)

with open('medium-tokenizer/merges.txt') as f:
    merges = f.readlines()
merges.pop(0)

j = open('medium-tokenizer/vocab.json')
vocab = json.load(j)
j = open('medium-tokenizer/added_tokens.json')
added_tokens = json.load(j)
j = open('medium-tokenizer/normalizer.json')
normalizer = json.load(j)

data['added_tokens'] = added_tokens
data['normalizer'] = normalizer
data['model']['vocab'] = vocab
data['model']['merges'] = merges

with open("tokenizer.json", "w") as outfile:
    json.dump(data, outfile)
ArthurZucker commented 6 months ago

Hey! You should be using the transformers library to load the slow tokenizer in a XXTokenizerFast which will automtically do the conversion if supported

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.