[Question] Is the sentencepiece alpha in Marian CLI the one used for BPE dropout?

marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository

https://marian-nmt.github.io

Other

255 stars 125 forks source link

[Question] Is the sentencepiece alpha in Marian CLI the one used for BPE dropout? #658

Closed alvations closed 4 years ago

alvations commented 4 years ago

Is the --sentencepiece-alphas in Marian CLI the same as the alpha on https://github.com/google/sentencepiece/blob/master/src/bpe_model.h#L43 to support BPE dropout when called at https://github.com/marian-nmt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp#L214 ?

If so, why is there more than 1 alpha?

emjotde commented 4 years ago

We haven't updated to newer SentencePiece for a while. If that concerns BPE dropout then likely not.

alvations commented 4 years ago

Thanks for the clarification! I'll play around the --sentencepiece-alphas in Marian and see what it does during training then =)

emjotde commented 4 years ago

Didn't do anything for me. Hence my skepticism towards BPE dropout as well.

alvations commented 4 years ago

Thanks again for checking! Seems like didn't do much to my .spm models too... Maybe it has to be used with some order mechanisms.

Lol, maybe it's better to just create the .spm with the latest dev branch in sentencepiece =)

emjotde commented 4 years ago

Don't think the algorithms actually changed.

alvations commented 4 years ago

I think the the SampleEncode overridden functions did have a new alpha argument in SentencePIece.

For context, I've just got to just figure out how to use the BPE dropout there to "replicate" someone else's experiment that uses https://github.com/VKCOM/YouTokenToMe 's (sentencepice's unigram entropy algorithm with BPE dropout).

alvations commented 4 years ago

Going to close this issue. Thanks for the explanations! I'll just check back when the new sentencepiece is updated in Marian =)

emjotde commented 4 years ago

A remark for clarity, the unigram method isn't BPE, hence no BPE dropout there. I guess you just mean the segmentation sampling?

alvations commented 4 years ago

yes the "BPE dropout" here refers to the segmentation sampling ("Multiple Subword Candidates"), i.e. for the same word, segment it into different ways. https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout =)

Maybe it should be called "subword dropout" or "subword sampling" or something, lol...