karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.19k stars 866 forks source link

BPE in Haskell #79

Open BobMcDear opened 5 months ago

BobMcDear commented 5 months ago

Hello,

I have created a port of minbpe in Haskell, minbpe-hs, that provides the same functionalities as minbpe minus GPT4Tokenizer. Thanks to the inherently recursive structure of BPE, it can be rendered quite nicely in functional languages, and I hope those who are struggling to apprehend the workings of this algorithm can benefit from studying its Haskell implementation.

The Wikipedia example can be reproduced using minbpe-hs as follows.

{-# LANGUAGE OverloadedStrings #-}

import BPE.Base
import BPE.Basic

main :: IO ()
main = do
    let (merges, vocab) = trainTokenizer (256 + 3) "aaabdaaabac"
    putStrLn $ show $ encode merges "aaabdaaabac"
    putStrLn $ show $ decode vocab [258, 100, 258, 97, 99]
    saveMergesAndVocab "toy" merges vocab

Would it be all right if I submit a PR to add this to the list of community extensions?

Thank you, Borna