logprobs in the vocabulary file do not match the values computed from the tokenized training document

pnugues commented 2 months ago

I trained a unigram model on botchan.txt following the documentation examples. I then reapplied this model to the training text and I evaluated new logprobs from it by counting the tokens.

These logprobs do not match exactly and the token order is not the same. I cannot explain why.

I used this command to create the model: spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=1000 --eos_id=-1 --bos_id=-1')

This produced a model vocabulary with these values:

<unk>   0
,   -3.40684
.   -3.54053
▁the    -3.54218
▁   -3.61926
s   -3.65378
▁I  -3.88789
▁to -4.0266
t   -4.09847
...

I tokenized botchan.txt with: sp.encode(corpus_raw, out_type=str) And I computed the word logprobs from the tokenized text:

',': -3.421150474912426,
'.': -3.5544435623622155,
'▁the': -3.572524455004845,
's': -3.7199046282648927,
'▁I': -3.9171822762124973,
'▁': -3.9276478082521584,
'▁to': -4.038808303148048,
'ed': -4.096767853187577,
...

The values are close, but not always as for '▁' and the order is different.

Does anyone have an explanation?

taku910 commented 1 month ago

If you just counted the tokens only from the output of encode method, the probabilities would be different.

In the ULM training, the EM algorithm is used to compute the marginal probabilities considering all possible tokenizations. The sp.encode() just performs Viterbi (one-best) decoding, which doesn't consider the all possible tokenizations.

pnugues commented 1 month ago

Dear Mr. Kudo,

First thank you for your answer and your time.

I read again your paper as well as your answer and I tried to modify my implementation to reproduce the log-probabilities I obtained with SentencePiece. I still cannot reproduce them in the m.vocab file. I write you this follow-up to understand where I am wrong.

Here are the results of my three experiments:

1/ I first trained a model with the options: '--input=botchan.txt --model_prefix=m --vocab_size=1000 --eos_id=-1 --bos_id=-1’ The log-probabilities of the nine first subwords in the m.vocab file are: {'': 0.0, ',': -3.40684, '.': -3.54053, '▁the': -3.54218, '▁': -3.61926, 's': -3.65378, '▁I': -3.88789, '▁to': -4.0266, 't': -4.09847, '▁a': -4.10513,

2/ I then segmented the original botchan.txt file with this model and sp.encode. I computed the log-probabilities: {',': -3.42115, '.': -3.55444, '▁the': -3.57252, 's': -3.71990, '▁I': -3.9172, '▁': -3.92765, '▁to': -4.03881, 'ed': -4.09677, '▁a': -4.11463, ... As you wrote, sp.encode uses a one-best decoding and this explains why the figures do not match those in the model.

3/ Following your message, I segmented the original botchan.txt file again with sp.nbest_encode_as_pieces(corpus_raw, 10). I could not create all the segmentations, but I believe that a beam of 10 should get the log-probabilities closer to your model. I obtained the figures below: {',': -3.42116, '.': -3.55445, '▁the': -3.57254, 's': -3.72037, '▁I': -3.91719, '▁': -3.92766, '▁to': -4.03882, 'ed': -4.09612, '▁a': -4.11598, …

They are very close to those in the second experiment and the ranking is the same. I also tried to use powers and the digamma function. This changes the values somehow, but it does not modify the ranking.

So my question is: Is there a way to reproduce the model log-probabilities at the decoding stage and how?

Thank you very much again.

Kindest regards, Pierre Nugues

Le 22 sept. 2024 à 08:00, Taku Kudo @.***> a écrit :

If you just counted the tokens only from the output of encode method, the probabilities would be different. In the ULM training, the EM algorithm is used to compute the marginal probabilities considering all possible tokenizations. The sp.encode() just performs Viterbi (one-best) decoding, which doesn't consider the all possible tokenizations. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

google / sentencepiece

logprobs in the vocabulary file do not match the values computed from the tokenized training document #1050