How can i use the SentencePieceTokenizer ?

keyvank / femtoGPT

Pure Rust implementation of a minimal Generative Pretrained Transformer

https://discord.gg/wTJFaDVn45

MIT License

770 stars 44 forks source link

Open Allakazan opened 1 month ago

Allakazan commented 1 month ago

Could you guys provide a description of how to use this tokenizer ? I tried by myself but i couldn't figure out on how to make it work.

Thanks :)

keyvank commented 1 month ago

Feed your dataset.txt to the spm_train command. (Docs here: https://github.com/google/sentencepiece)

This will generate a "vocab file". Use it for initializing the SentencePieceTokenizer:

SentencePieceTokenizer::load("model.vocab")

Allakazan commented 1 month ago

Thanks, it works :)

Now i will try to implement some sort of EOS_TOKEN on the tokenizer, for doing a question/answer model

keyvank commented 1 month ago

Sounds good! Let me know if you got good results!