Extract & modify the merge rules from the .model file of a SentencePiece BPE model

google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache License 2.0

10.25k stars 1.17k forks source link

Hi @taku910

To my understanding, a SentencePiece Unigram model uses the vocab & the score to tokenize while a SentencePiece BPE model uses both the vocab and merge rules to tokenize.

Obtaining and modifying the vocabulary of a SentencePiece BPE model has been well documented online but not really for the merge rules?

I have read #444 and hence to avoid confusion, I just want to clarify that I am asking if is there a way to extract & modify specifically the merge rules from the .model file of a SentencePiece BPE model?

Thank you!

google / sentencepiece

Extract & modify the merge rules from the .model file of a SentencePiece BPE model #958