Closed lsy641 closed 1 year ago
Sampling score is relying on the score (logprob) of the vocab file.
Given one possible segmentation W=w1, w2, ... wn, the generation probability of W is computed as P(W) = exp(\sum_k logprob(w_k)). We sample the sequence W with respect to the probability P(W).
There are several sampling modes (e.g., nbest-sampling, include-best, without replenishment), but we use the forward-filtering-and-backward-sampling algorithm as the basic algorithm.
This article is useful to understand the FFBS algorithm. https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15
By the way, The part "token is remove from the corpus" is the algorithm to train the sentencepiece. We don't use it in the inference time
How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? Does the sentence sampling score rely on the token score recorded in the vocabulary file? If the score of token is the log probability of the token in the unigram model, how does the model calculate the sentence sampling score? And I saw in the origin paper, the tokens are sorted according to the loss of likelihood if this token is removed from the corpus. I thought this loss is another score. Where can I see it?