marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 228 forks source link

marian_decoder starting and ending logic #332

Open sshleifer opened 4 years ago

sshleifer commented 4 years ago

I was inspecting intermediate values of the output tensor transformer.h, while running marian_decoder, and noticed that the first step through the decoder some sort of token is passed that has 0 word embedding. Q1) What token is used as a prefix? Are there tricks to make it's embedding 0?

Q2) How does the decoder know to terminate a translation? In my python port of the opus-nmt models, the decoder never predicts ''.

Additional Clues

My python port of the opus-nmt models works nicely when english is the source language, and just generates a dummy token when it is done translating. For fr-en, it generates nonsense at the beginning of the generation, whereas marian-decoder generates no nonsense at all :)

sample_text = "Donnez moi le micro ."
my_result = ', uh... give me the microphone .'] # after constraining max_length.
marian_decoder = 'Give me the microphone!' # after sentencepiece

Thanks in Advance!

frankseide commented 4 years ago

Q1: The embedding of the sentence-start (BOS or <s>) context is hard-coded to be 0. It is not copied from the embedding matrix. I always felt that's a bug, but anecdotally, it makes no accuracy difference.

Q2: Each beam hypothesis that ends in EOS (or </s>) will cease to be expanded. Once all hyps for a sentence end in EOS, sentence translation is complete.