Closed alvations closed 4 years ago
We haven't updated to newer SentencePiece for a while. If that concerns BPE dropout then likely not.
Thanks for the clarification! I'll play around the --sentencepiece-alphas
in Marian and see what it does during training then =)
Didn't do anything for me. Hence my skepticism towards BPE dropout as well.
Thanks again for checking! Seems like didn't do much to my .spm models too... Maybe it has to be used with some order mechanisms.
Lol, maybe it's better to just create the .spm
with the latest dev branch in sentencepiece =)
Don't think the algorithms actually changed.
I think the the SampleEncode overridden functions did have a new alpha
argument in SentencePIece.
For context, I've just got to just figure out how to use the BPE dropout there to "replicate" someone else's experiment that uses https://github.com/VKCOM/YouTokenToMe 's (sentencepice's unigram entropy algorithm with BPE dropout).
Going to close this issue. Thanks for the explanations! I'll just check back when the new sentencepiece is updated in Marian =)
A remark for clarity, the unigram method isn't BPE, hence no BPE dropout there. I guess you just mean the segmentation sampling?
yes the "BPE dropout" here refers to the segmentation sampling ("Multiple Subword Candidates"), i.e. for the same word, segment it into different ways. https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout =)
Maybe it should be called "subword dropout" or "subword sampling" or something, lol...
Is the
--sentencepiece-alphas
in Marian CLI the same as the alpha on https://github.com/google/sentencepiece/blob/master/src/bpe_model.h#L43 to support BPE dropout when called at https://github.com/marian-nmt/marian-dev/blob/master/src/data/sentencepiece_vocab.cpp#L214 ?If so, why is there more than 1 alpha?