Open Allakazan opened 1 month ago
Feed your dataset.txt
to the spm_train
command. (Docs here: https://github.com/google/sentencepiece)
This will generate a "vocab file". Use it for initializing the SentencePieceTokenizer:
SentencePieceTokenizer::load("model.vocab")
Thanks, it works :)
Now i will try to implement some sort of EOS_TOKEN on the tokenizer, for doing a question/answer model
Sounds good! Let me know if you got good results!
Could you guys provide a description of how to use this tokenizer ? I tried by myself but i couldn't figure out on how to make it work.
Thanks :)