google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.25k stars 1.17k forks source link

Extract & modify the merge rules from the .model file of a SentencePiece BPE model #958

Closed kitkhai closed 10 months ago

kitkhai commented 10 months ago

Hi @taku910

To my understanding, a SentencePiece Unigram model uses the vocab & the score to tokenize while a SentencePiece BPE model uses both the vocab and merge rules to tokenize.

Obtaining and modifying the vocabulary of a SentencePiece BPE model has been well documented online but not really for the merge rules?

I have read #444 and hence to avoid confusion, I just want to clarify that I am asking if is there a way to extract & modify specifically the merge rules from the .model file of a SentencePiece BPE model?

Thank you!

taku910 commented 10 months ago

Sentencepiece library does not directly/officially support expanding or merging vocabularies. (No such API/documents are provided.) However, since model files are protobufs and the .proto file contains enough comments on the message, we can edit/modify the model file manually to add vocabulary at-your-own-risk-basis.