Closed jelmervdl closed 4 years ago
Adding that these are WNGT-style models with SentencePiece.
Please note that those models have not been trained on texts with escaped chars like from the Moses tokenizer, they are trained on data with normalized quotes and whitespaces only (generally an unprocessed text, subword segmentation is handled internally in Marian).
I guess you are just exceeding the default input length limit of 1000, after SentencePiece tokenizes the input internally. Adding --max-length-crop
should prevent the decoder from stopping after encountering the first line longer than --max-length
.
I think @snukky might be right here. Adding --max-length-crop
will cause the issue to no longer appear.
More of a related user question: is it intended behaviour that translation stops without visible error message when a (too) long input sentence is encountered?
It has been discussed here: https://github.com/marian-nmt/marian-dev/issues/365
Thank you! I'll close this issue since it's not a bug. Sorry about that!
Bug description
When a line that starts with too many encoded apostrophes (i.e. ') is passed as input, marian-decoder stops on it, ignoring the rest of the input. For example, giving it marian-not-ok.txt as input will only result in AAAA as output.
If there are just a little fewer ' in the input, like in marian-ok.txt it does continue. This input produces AAAA\n<garbage>\nBBBBB as expected.
How to reproduce
This was tested using the Estonian-English model from http://statmt.org/bergamot/models/ (with the config.yml provided, which does not use any of the optimisations not available in the marian-dev master branch)
Bug-inducing input:
Similar but okay input:
Context
Marian version: v1.9.25; 80232e61 2020-06-24 14:06:50 -0700
CMake command:
Log file: marian.log