google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.06k stars 1.16k forks source link

Bug with BPE dropout? #569

Closed steremma closed 3 years ago

steremma commented 3 years ago

I am attempting to use the BPE-dropout feature from the either the command line or the python API. I show the python example because spm_encode doesn't support dropout-bpe at all based on spm_encode --help.

Im using version 0.1.93.

I start by making my BPE model. There was no mention of dropout in spm_train --help so I assume we don't have to specify anything here.

spm_train --input=spm.corpus --model_prefix=spm --vocab_size=32000 --model_type=bpe --byte_fallback

We will now use this model to tokenise stochastically from Python. Looking at help(s.encode) I read alpha: Soothing parameter for unigram sampling, and merge probability for BPE-dropout. The parameter seems to have no effect unless we set enable_sampling=True, for example:

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> s.encode('New York', out_type=str, alpha=1.0)
['▁New', '▁York']
>>> s.encode('New York', out_type=str, alpha=0.0)
['▁New', '▁York']
>>> s.encode('New York', out_type=str, alpha=0.5)
['▁New', '▁York']

When we do set it, I expected that alpha=1.0 would yield character segmentation (always drop-out) while alpha=0.0 would yield deterministic BPE. (Actually the documentation mentions merge probability not dropout probability so I would expect the opposite but let's assume there was a misspelling in the doc).

In any case while alpha = 1.0 works, alpha = 0.0 doesn't. The function seems to be unaware that my model is BPE and not Kudo's LM based on the error message:

>>> # Good
>>> s.encode('New York', out_type=str, enable_sampling=True, alpha=1.0)
['▁', 'N', 'e', 'w', '▁', 'Y', 'o', 'r', 'k']
>>> # Here I expected deterministic BPE
>>> s.encode('New York', out_type=str, enable_sampling=True, alpha=0.0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/work/home/estergiadis/tf2.0/lib/python3.6/site-packages/", line 268, in Encode
    'When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", '
RuntimeError: When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", and "0.0 < alpha < 1.0". "nbest_size = -1" is enabled only on unigram mode and samples from all candidates on the lattice instead of nbest segmentations. 

The error message is obviously false (we were able to use enable_sampling=True without specifying nbest_size and with alpha==1.0 in the call right above.

So all in all, how should one do BPE-dropout?

taku910 commented 3 years ago

Yes, this part is really confusing, as the the effect of parameter "alpha" is opposite in BPE and unigram.



We will update the comment and the expected range of alpha. Anyway, as long as setting 0 < alpha < 1. BPE-drop will work as expected.

steremma commented 3 years ago

Thanks for the quick response, I think this makes sense. One tiny detail: should it perhaps be called dropout prob instead of merge prob? When alpha = 0 we have normal BPE meaning nothing is dropped and every merge does happen, so merge prob would be 1.0.

taku910 commented 3 years ago

Updated the document and behavior in v0.1.94. Now alpha=0 or 1.0 is accepted in BPE mode.