Closed kitkhai closed 10 months ago
Sentencepiece library does not directly/officially support expanding or merging vocabularies. (No such API/documents are provided.) However, since model files are protobufs and the .proto file contains enough comments on the message, we can edit/modify the model file manually to add vocabulary at-your-own-risk-basis.
Hi @taku910
To my understanding, a SentencePiece Unigram model uses the vocab & the score to tokenize while a SentencePiece BPE model uses both the vocab and merge rules to tokenize.
Obtaining and modifying the vocabulary of a SentencePiece BPE model has been well documented online but not really for the merge rules?
I have read #444 and hence to avoid confusion, I just want to clarify that I am asking if is there a way to extract & modify specifically the merge rules from the .model file of a SentencePiece BPE model?
Thank you!