Hyperparameter arguments in Python wrapper

google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache License 2.0

10.27k stars 1.18k forks source link

Thank you for the feedback.

I think SampleEncode() is C++ API, and is an alias of SampleEncodeAsPieces.

Here's the difference EncodeAsPieces: Split the sentence into tokens (pieces). 1-best path (viterbi path) is returned. EncodeAsIds: Split the sentence into id sequence. SampleEncodeAsPiece: Given multiple segmentation candidates, sample one segmentation with the unigram language model SampleEncodeAsId: The output is id sequence

For sampling, we have to set hyperparameters l and alpha which control the smoothness of the unigram language model. For more detail, please take a look at https://arxiv.org/abs/1804.10959

google / sentencepiece

Hyperparameter arguments in Python wrapper #211