google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.25k stars 1.17k forks source link

Any api for setting user defined symbols? #991

Closed zhangyuhanjc closed 7 months ago

zhangyuhanjc commented 7 months ago

is there any api to achieve --user_defined_symbols='' ? i only find function SetEncodeExtraOptions which cannot set user_defined_symbols

taku910 commented 7 months ago

Adding the user_defined_symbols to the pre-trained model is not officially supported, but possible at-your-own-risk-basis as it is stored as protobuf.

https://github.com/google/sentencepiece/issues/121