I have created a port of minbpe in Haskell, minbpe-hs, that provides the same functionalities as minbpe minus GPT4Tokenizer. Thanks to the inherently recursive structure of BPE, it can be rendered quite nicely in functional languages, and I hope those who are struggling to apprehend the workings of this algorithm can benefit from studying its Haskell implementation.
The Wikipedia example can be reproduced using minbpe-hs as follows.
{-# LANGUAGE OverloadedStrings #-}
import BPE.Base
import BPE.Basic
main :: IO ()
main = do
let (merges, vocab) = trainTokenizer (256 + 3) "aaabdaaabac"
putStrLn $ show $ encode merges "aaabdaaabac"
putStrLn $ show $ decode vocab [258, 100, 258, 97, 99]
saveMergesAndVocab "toy" merges vocab
Would it be all right if I submit a PR to add this to the list of community extensions?
Hello,
I have created a port of minbpe in Haskell, minbpe-hs, that provides the same functionalities as minbpe minus
GPT4Tokenizer
. Thanks to the inherently recursive structure of BPE, it can be rendered quite nicely in functional languages, and I hope those who are struggling to apprehend the workings of this algorithm can benefit from studying its Haskell implementation.The Wikipedia example can be reproduced using minbpe-hs as follows.
Would it be all right if I submit a PR to add this to the list of community extensions?
Thank you, Borna