google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.26k stars 1.18k forks source link

How to use BPE-Dropout? #535

Closed samin9796 closed 2 years ago

samin9796 commented 4 years ago

How can I use BPE-Dropout? I don't see any changes if I try out different alpha values for BPE model.

taku910 commented 4 years ago

Please elaborate what you tried e.g., command line flags, and/or python code (if you are using python module)

xiefangqi commented 2 years ago

The example is in README.md: Subword regularization and BPE-dropout Subword regularization [Kudo.] and BPE-dropout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

import sentencepiece as spm s = spm.SentencePieceProcessor(model_file='spm.model') for n in range(5): ... s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1) ... ['▁', 'N', 'e', 'w', '▁York'] ['▁', 'New', '▁York'] ['▁', 'New', '▁Y', 'o', 'r', 'k'] ['▁', 'New', '▁York'] ['▁', 'New', '▁York']

taku910 commented 2 years ago

Please let me close this issue since there seems to be no further discussion