google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.06k stars 1.16k forks source link

Bug with BPE dropout? #569

Closed steremma closed 3 years ago

steremma commented 3 years ago

I am attempting to use the BPE-dropout feature from the either the command line or the python API. I show the python example because spm_encode doesn't support dropout-bpe at all based on spm_encode --help.

Im using version 0.1.93.

I start by making my BPE model. There was no mention of dropout in spm_train --help so I assume we don't have to specify anything here.

spm_train --input=spm.corpus --model_prefix=spm --vocab_size=32000 --model_type=bpe --byte_fallback

We will now use this model to tokenise stochastically from Python. Looking at help(s.encode) I read alpha: Soothing parameter for unigram sampling, and merge probability for BPE-dropout. The parameter seems to have no effect unless we set enable_sampling=True, for example:

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> s.encode('New York', out_type=str, alpha=1.0)
['▁New', '▁York']
>>> s.encode('New York', out_type=str, alpha=0.0)
['▁New', '▁York']
>>> s.encode('New York', out_type=str, alpha=0.5)
['▁New', '▁York']

When we do set it, I expected that alpha=1.0 would yield character segmentation (always drop-out) while alpha=0.0 would yield deterministic BPE. (Actually the documentation mentions merge probability not dropout probability so I would expect the opposite but let's assume there was a misspelling in the doc).

In any case while alpha = 1.0 works, alpha = 0.0 doesn't. The function seems to be unaware that my model is BPE and not Kudo's LM based on the error message:

>>> # Good
>>> s.encode('New York', out_type=str, enable_sampling=True, alpha=1.0)
['▁', 'N', 'e', 'w', '▁', 'Y', 'o', 'r', 'k']
>>> # Here I expected deterministic BPE
>>> s.encode('New York', out_type=str, enable_sampling=True, alpha=0.0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/work/home/estergiadis/tf2.0/lib/python3.6/site-packages/sentencepiece.py", line 268, in Encode
    'When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", '
RuntimeError: When enable_sampling is True, We must specify "nbest_size > 1" or "nbest_size = -1", and "0.0 < alpha < 1.0". "nbest_size = -1" is enabled only on unigram mode and samples from all candidates on the lattice instead of nbest segmentations. 

The error message is obviously false (we were able to use enable_sampling=True without specifying nbest_size and with alpha==1.0 in the call right above.

So all in all, how should one do BPE-dropout?

taku910 commented 3 years ago

Yes, this part is really confusing, as the the effect of parameter "alpha" is opposite in BPE and unigram.

Unigram:

BPE:

We will update the comment and the expected range of alpha. Anyway, as long as setting 0 < alpha < 1. BPE-drop will work as expected.

steremma commented 3 years ago

Thanks for the quick response, I think this makes sense. One tiny detail: should it perhaps be called dropout prob instead of merge prob? When alpha = 0 we have normal BPE meaning nothing is dropped and every merge does happen, so merge prob would be 1.0.

taku910 commented 3 years ago

Updated the document and behavior in v0.1.94. Now alpha=0 or 1.0 is accepted in BPE mode.