How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"?

google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.

Apache License 2.0

10.26k stars 1.17k forks source link

Sampling score is relying on the score (logprob) of the vocab file.

Given one possible segmentation W=w1, w2, ... wn, the generation probability of W is computed as P(W) = exp(\sum_k logprob(w_k)). We sample the sequence W with respect to the probability P(W).

There are several sampling modes (e.g., nbest-sampling, include-best, without replenishment), but we use the forward-filtering-and-backward-sampling algorithm as the basic algorithm.

This article is useful to understand the FFBS algorithm. https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15

By the way, The part "token is remove from the corpus" is the algorithm to train the sentencepiece. We don't use it in the inference time

google / sentencepiece

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884