Implementation of LlamaTokenizer (without sentencepiece)

@karpathy

Thanks for the great lecture and implementation! As always, it was a pleasure.

I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.

Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs

Best Haris

karpathy / minbpe

Implementation of LlamaTokenizer (without sentencepiece) #60