google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.26k stars 1.17k forks source link

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

Closed lsy641 closed 1 year ago

lsy641 commented 1 year ago

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? Does the sentence sampling score rely on the token score recorded in the vocabulary file? If the score of token is the log probability of the token in the unigram model, how does the model calculate the sentence sampling score? And I saw in the origin paper, the tokens are sorted according to the loss of likelihood if this token is removed from the corpus. I thought this loss is another score. Where can I see it?

taku910 commented 1 year ago

Sampling score is relying on the score (logprob) of the vocab file.

Given one possible segmentation W=w1, w2, ... wn, the generation probability of W is computed as P(W) = exp(\sum_k logprob(w_k)). We sample the sequence W with respect to the probability P(W).

There are several sampling modes (e.g., nbest-sampling, include-best, without replenishment), but we use the forward-filtering-and-backward-sampling algorithm as the basic algorithm.

This article is useful to understand the FFBS algorithm. https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15

By the way, The part "token is remove from the corpus" is the algorithm to train the sentencepiece. We don't use it in the inference time