marian-decoder stops on line without words

jelmervdl commented 4 years ago

Bug description

When a line that starts with too many encoded apostrophes (i.e. ') is passed as input, marian-decoder stops on it, ignoring the rest of the input. For example, giving it marian-not-ok.txt as input will only result in AAAA as output.

If there are just a little fewer ' in the input, like in marian-ok.txt it does continue. This input produces AAAA\n<garbage>\nBBBBB as expected.

How to reproduce

This was tested using the Estonian-English model from http://statmt.org/bergamot/models/ (with the config.yml provided, which does not use any of the optimisations not available in the marian-dev master branch)

Bug-inducing input:

$ cat marian-not-ok.txt | ~/src/marian-dev/build/marian-decoder -c $MODEL/config.yml --quiet
AAAAAAAA

Similar but okay input:

$ cat marian-ok.txt | ~/src/marian-dev/build/marian-decoder -c $MODEL/config.yml --quiet
AAAAAAAA
&a-Assy; & &&.;a; and theater, theater-the-funds of theater's &a-the-plus, theater-size theater's &a, the theater's &a, theater-sphere, theater-sected, theater-sand; and the theater-sporation, the theater-fund of the thefts of the theater-funds, the thely-and-poor, the theft of the theft of the &a, the theasserables of the theathesscence of the thefts of the thea-swolfuses the-sover, the theftillties of the thea-smatures-sections, that of a, "secity and and's, and and."
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Context

Marian version: v1.9.25; 80232e61 2020-06-24 14:06:50 -0700

CMake command:

AVX2_FOUND=true
AVX512_FOUND=true
AVX_FOUND=true
BUILD_ARCH=native
CMAKE_ADDR2LINE=/usr/bin/addr2line
CMAKE_AR=/usr/bin/ar
CMAKE_BUILD_TYPE=Release
CMAKE_COLOR_MAKEFILE=ON
CMAKE_CXX_COMPILER=/usr/local/software/archive/linux-scientific7-x86_64/gcc-9/gcc-8.4.0-cmt6vj7mvn5mqvomwknaujmkbzggomki/bin/g++
CMAKE_CXX_COMPILER_AR=/usr/local/software/master/gcc/8/bin/gcc-ar
CMAKE_CXX_COMPILER_RANLIB=/usr/local/software/master/gcc/8/bin/gcc-ranlib
CMAKE_CXX_FLAGS=-std=c++11 -pthread -Wl,--no-as-needed -fPIC -Wno-unused-result -Wno-unknown-warning-option  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DUSE_SENTENCEPIECE -DMKL_ILP64 -m64
CMAKE_CXX_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_CXX_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_CXX_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_COMPILER=/usr/local/software/archive/linux-scientific7-x86_64/gcc-9/gcc-8.4.0-cmt6vj7mvn5mqvomwknaujmkbzggomki/bin/gcc
CMAKE_C_COMPILER_AR=/usr/local/software/master/gcc/8/bin/gcc-ar
CMAKE_C_COMPILER_RANLIB=/usr/local/software/master/gcc/8/bin/gcc-ranlib
CMAKE_C_FLAGS=-pthread -Wl,--no-as-needed -fPIC -Wno-unused-result -Wno-unknown-warning-option  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DMKL_ILP64 -m64
CMAKE_C_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_C_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_DLLTOOL=CMAKE_DLLTOOL-NOTFOUND
CMAKE_INSTALL_PREFIX=/usr/local
CMAKE_LINKER=/usr/bin/ld
CMAKE_MAKE_PROGRAM=/usr/bin/gmake
CMAKE_NM=/usr/bin/nm
CMAKE_OBJCOPY=/usr/bin/objcopy
CMAKE_OBJDUMP=/usr/bin/objdump
CMAKE_RANLIB=/usr/bin/ranlib
CMAKE_READELF=/usr/bin/readelf
CMAKE_SKIP_INSTALL_RPATH=NO
CMAKE_SKIP_RPATH=NO
CMAKE_STRIP=/usr/bin/strip
CMAKE_VERBOSE_MAKEFILE=FALSE
COMPILE_CPU=on
COMPILE_CUDA=off
COMPILE_CUDA_SM35=ON
COMPILE_CUDA_SM50=ON
COMPILE_CUDA_SM60=ON
COMPILE_CUDA_SM70=ON
COMPILE_EXAMPLES=OFF
COMPILE_SERVER=OFF
COMPILE_TESTS=OFF
GIT_EXECUTABLE=/usr/local/software/global/bin/git
INTEL_ROOT=/opt/intel
MKL_CORE_LIBRARY=/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/lib/intel64/libmkl_core.a
MKL_INCLUDE_DIR=/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/include
MKL_INCLUDE_DIRS=/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/include
MKL_INTERFACE_LIBRARY=/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/lib/intel64/libmkl_intel_ilp64.a
MKL_LIBRARIES=-Wl,--start-group;/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/lib/intel64/libmkl_intel_ilp64.a;/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/lib/intel64/libmkl_sequential.a;/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group
MKL_ROOT=/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl
MKL_SEQUENTIAL_LAYER_LIBRARY=/usr/local/Cluster-Apps/intel/2019.3/compilers_and_libraries_2019.3.192/linux/mkl/lib/intel64/libmkl_sequential.a
PROTOBUF_INCLUDE_DIR=/rds/project/t2_vol4/rds-t2-cs119/jelmervdl/protobuf-3.12.3/include
PROTOBUF_LIBRARY=/rds/project/t2_vol4/rds-t2-cs119/jelmervdl/protobuf-3.12.3/lib/libprotobuf.so
PROTOBUF_PROTOC_EXECUTABLE=/rds/project/t2_vol4/rds-t2-cs119/jelmervdl/protobuf-3.12.3/bin/protoc
SSE2_FOUND=true
SSE3_FOUND=true
SSE4_1_FOUND=true
SSE4_2_FOUND=true
SSSE3_FOUND=true
Tcmalloc_INCLUDE_DIR=Tcmalloc_INCLUDE_DIR-NOTFOUND
Tcmalloc_LIBRARY=Tcmalloc_LIBRARY-NOTFOUND
USE_CCACHE=OFF
USE_CUDNN=OFF
USE_DOXYGEN=ON
USE_FBGEMM=OFF
USE_MKL=ON
USE_MPI=OFF
USE_NCCL=ON
USE_SENTENCEPIECE=on
USE_STATIC_LIBS=on

Log file: marian.log

kpu commented 4 years ago

Adding that these are WNGT-style models with SentencePiece.

snukky commented 4 years ago

Please note that those models have not been trained on texts with escaped chars like from the Moses tokenizer, they are trained on data with normalized quotes and whitespaces only (generally an unprocessed text, subword segmentation is handled internally in Marian).

I guess you are just exceeding the default input length limit of 1000, after SentencePiece tokenizes the input internally. Adding --max-length-crop should prevent the decoder from stopping after encountering the first line longer than --max-length.

jelmervdl commented 4 years ago

I think @snukky might be right here. Adding --max-length-crop will cause the issue to no longer appear.

More of a related user question: is it intended behaviour that translation stops without visible error message when a (too) long input sentence is encountered?

snukky commented 4 years ago

It has been discussed here: https://github.com/marian-nmt/marian-dev/issues/365

jelmervdl commented 4 years ago

Thank you! I'll close this issue since it's not a bug. Sorry about that!

marian-nmt / marian-dev