google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.27k stars 1.18k forks source link

Hyperparameter arguments in Python wrapper #211

Closed desh2608 closed 6 years ago

desh2608 commented 6 years ago

This is regarding the pip package.

After training the unigram model using sentencepiece.SentencePieceTrainer.Train(train_args), suppose I want to sample a subword segmentation for a sentence. I am confused regarding the use of the following functions: SampleEncode(), EncodeAsPieces(), and SampleEncodeAsPieces(). What exactly do each of these do?

Also, can I add arguments to specify the hyperparameters l(# of best segmentations to use) and alpha(smoothing parameter)?

taku910 commented 6 years ago

Thank you for the feedback.

I think SampleEncode() is C++ API, and is an alias of SampleEncodeAsPieces.

Here's the difference EncodeAsPieces: Split the sentence into tokens (pieces). 1-best path (viterbi path) is returned. EncodeAsIds: Split the sentence into id sequence. SampleEncodeAsPiece: Given multiple segmentation candidates, sample one segmentation with the unigram language model SampleEncodeAsId: The output is id sequence

For sampling, we have to set hyperparameters l and alpha which control the smoothness of the unigram language model. For more detail, please take a look at https://arxiv.org/abs/1804.10959