Thanks for the great lecture and implementation! As always, it was a pleasure.
I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.
Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs
@karpathy
Thanks for the great lecture and implementation! As always, it was a pleasure.
I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.
Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs
Best Haris