karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.08k stars 839 forks source link

Implementation of LlamaTokenizer (without sentencepiece) #60

Open MaveriQ opened 6 months ago

MaveriQ commented 6 months ago

@karpathy

Thanks for the great lecture and implementation! As always, it was a pleasure.

I have tried to implement LlamaTokenizer (without using sentencepiece backend) staying as close to minbpe implementation as possible. Essentially it involves doing BPE on unicode, having utf-8 byte fallback and using character coverage to handle rare tokens doing training. The implementation is available here. I haven't made a pull request because it's still not EXACTLY the same as LlamaTokenizer. But I am hoping people can use it as a starting point.

Please refer to the README.md (point 6) for details on new functionality and caveats/TODOs

Best Haris