How to create Tokenizer.json?

kenaii commented 7 months ago

I have this tokenizer and I want to convert it to tokenizer.json format.

added_tokens.json
normalizer.json
special_tokens_map.json
config.json
preprocessor_config.json
vocab.json
merges.txt
pytorch_model.bin

Is it possible to replace my tokenizer data with the original tokenizer.json?

import json

j = open('hf/tokenizer.json')
data = json.load(j)

with open('medium-tokenizer/merges.txt') as f:
    merges = f.readlines()
merges.pop(0)

j = open('medium-tokenizer/vocab.json')
vocab = json.load(j)
j = open('medium-tokenizer/added_tokens.json')
added_tokens = json.load(j)
j = open('medium-tokenizer/normalizer.json')
normalizer = json.load(j)

data['added_tokens'] = added_tokens
data['normalizer'] = normalizer
data['model']['vocab'] = vocab
data['model']['merges'] = merges

with open("tokenizer.json", "w") as outfile:
    json.dump(data, outfile)

ArthurZucker commented 6 months ago

Hey! You should be using the transformers library to load the slow tokenizer in a XXTokenizerFast which will automtically do the conversion if supported

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / tokenizers

How to create Tokenizer.json? #1410