Closed desh2608 closed 6 years ago
Thank you for the feedback.
I think SampleEncode()
is C++ API, and is an alias of SampleEncodeAsPieces
.
Here's the difference
EncodeAsPieces
: Split the sentence into tokens (pieces). 1-best path (viterbi path) is returned.
EncodeAsIds
: Split the sentence into id sequence.
SampleEncodeAsPiece
: Given multiple segmentation candidates, sample one segmentation with the unigram language model
SampleEncodeAsId
: The output is id sequence
For sampling, we have to set hyperparameters l and alpha which control the smoothness of the unigram language model. For more detail, please take a look at https://arxiv.org/abs/1804.10959
This is regarding the pip package.
After training the unigram model using
sentencepiece.SentencePieceTrainer.Train(train_args)
, suppose I want to sample a subword segmentation for a sentence. I am confused regarding the use of the following functions:SampleEncode()
,EncodeAsPieces()
, andSampleEncodeAsPieces()
. What exactly do each of these do?Also, can I add arguments to specify the hyperparameters
l
(# of best segmentations to use) andalpha
(smoothing parameter)?