mlc-ai / tokenizers-cpp

Universal cross-platform tokenizers binding to HF and sentencepiece
Apache License 2.0
211 stars 47 forks source link

If the model hasn't tokenizer.json file, what should I do? #13

Closed wolf-li closed 3 months ago

wolf-li commented 10 months ago

Not all model in huggingface hub has tokenizer.json file such like Marian model. 'tokenizer_config.json', 'special_tokens_map.json', 'vocab.json', 'source.spm', 'target.spm', 'added_tokens.json' too much files. What should I do?

FFengIll commented 9 months ago

vocab.json can be used to load and parse into tokenizer info.

tqchen commented 3 months ago

Seems one common approach so far is to convert the other tokenizer format into HF's tokenizer.json format