marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 127 forks source link

marian-decoder stops on line without words #667

Closed jelmervdl closed 4 years ago

jelmervdl commented 4 years ago

Bug description

When a line that starts with too many encoded apostrophes (i.e. ') is passed as input, marian-decoder stops on it, ignoring the rest of the input. For example, giving it marian-not-ok.txt as input will only result in AAAA as output.

If there are just a little fewer &apos; in the input, like in marian-ok.txt it does continue. This input produces AAAA\n<garbage>\nBBBBB as expected.

How to reproduce

This was tested using the Estonian-English model from http://statmt.org/bergamot/models/ (with the config.yml provided, which does not use any of the optimisations not available in the marian-dev master branch)

Bug-inducing input:

$ cat marian-not-ok.txt | ~/src/marian-dev/build/marian-decoder -c $MODEL/config.yml --quiet
AAAAAAAA

Similar but okay input:

$ cat marian-ok.txt | ~/src/marian-dev/build/marian-decoder -c $MODEL/config.yml --quiet
AAAAAAAA
&a-Assy; & &&.;a; and theater, theater-the-funds of theater's &a-the-plus, theater-size theater's &a, the theater's &a, theater-sphere, theater-sected, theater-sand; and the theater-sporation, the theater-fund of the thefts of the theater-funds, the thely-and-poor, the theft of the theft of the &a, the theasserables of the theathesscence of the thefts of the thea-swolfuses the-sover, the theftillties of the thea-smatures-sections, that of a, "secity and and's, and and."
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Context

kpu commented 4 years ago

Adding that these are WNGT-style models with SentencePiece.

snukky commented 4 years ago

Please note that those models have not been trained on texts with escaped chars like from the Moses tokenizer, they are trained on data with normalized quotes and whitespaces only (generally an unprocessed text, subword segmentation is handled internally in Marian).

I guess you are just exceeding the default input length limit of 1000, after SentencePiece tokenizes the input internally. Adding --max-length-crop should prevent the decoder from stopping after encountering the first line longer than --max-length.

jelmervdl commented 4 years ago

I think @snukky might be right here. Adding --max-length-crop will cause the issue to no longer appear.

More of a related user question: is it intended behaviour that translation stops without visible error message when a (too) long input sentence is encountered?

snukky commented 4 years ago

It has been discussed here: https://github.com/marian-nmt/marian-dev/issues/365

jelmervdl commented 4 years ago

Thank you! I'll close this issue since it's not a bug. Sorry about that!