Sockeye is training much faster than Marian

Bug description

Sockeye is training much faster than Marian. I run a 1 data epoch long training on a 4.7M training examples small data set with either framework. To best of my knowledge I used comparable training parameters for both frameworks. Bu the results were 21 min vs 36 min, favoring Sockeye. What I do not know is, if it is a problem due to my old setup - Ubuntu 18.04.6 and everything that follows from that (e.g. old compiler and other stuff), or it something to do with Marian.

How to reproduce

A typical way of training Sockeye systems is to run data prep step before training. sockeye-prepare-data --source train.bpe.en --target train.bpe.lv --output . --max-seq-len 128 --shared-vocab --num-words 25000 Data prep time was not included in training time. To measure Sockeye's training time I used timestamps between start and end of the training, which to me worked out to be 21 min. touch sockeye.start & torchrun --no_python --nproc_per_node 2 sockeye-train --prepared-data . --output models --validation-source dev.bpe.en --validation-target dev.bpe.lv --max-num-epochs 1 --shared-vocab --dist --amp --update-interval 12 --batch-size 18000--max-seq-len 128 > training.log 2>&1 & touch sockeye.end For Marian I used /marian-vocab --max-size 25000 marian --devices 0 1 --type transformer --model /tmp/toms/sockeye-test/model.npz --train-sets /tmp/toms/sockeye-test/train.bpe.en /tmp/toms/sockeye-test/train.bpe.lv --vocabs en-lv-shared-vocab.yml en-lv-shared-vocab.yml --max-length 128 --max-length-factor 1.5 --mini-batch-fit --workspace 18000 --maxi-batch 2000 --early-stopping 10 --valid-freq 1000000 --save-freq 2000000 --disp-freq 100 --keep-best --overwrite --valid-metrics cross-entropy translation --valid-sets /tmp/toms/sockeye-test/dev.bpe.en /tmp/toms/sockeye-test/dev.bpe.lv --valid-script-path /tmp/toms/sockeye-test/validate.sh --log /tmp/toms/sockeye-test/train.log --valid-log /tmp/toms/sockeye-test/valid.log --seed 347155 --exponential-smoothing --normalize 0.6 --beam-size 6 --quiet-translation --valid-translation-output /tmp/toms/sockeye-test/valid.output.txt --valid-mini-batch 16 --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-preprocess d --transformer-postprocess-emb d --transformer-postprocess dan --optimizer-delay 12 --learn-rate 0.0005 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --clip-norm 5 --tied-embeddings-all --sync-sgd --transformer-dropout 0.1 --transformer-dropout-attention 0.1 --transformer-dropout-ffn 0.1 --optimizer adam --optimizer-params 0.9 0.98 1e-09 --sqlite /tmp/en-lv-W69bwc2f6meuT-combined.db -e 1 --fp16 To measure Marian's training time I used timestamps for outputs Training started and Training finished which to me worked out to be around 36 min. This was with Marian version: v1.10.24; 4dd30b50 2021-09-08 14:02:21 +0100 I also tried Marian v1.11.0 f00d0621 2022-02-08 08:39:24 -0800 but it gave even worse - 43 min.

I do realize, that Marian's --workspace 18000 and Sockeye's --batch-size 18000 aren't the same, however, running with different --batch-size values didn't affect time it took Sockeye to train for one epoch.

I also checked if both frameworks have seen the same number of sentences during their respective training runs. The numbers were about the same.

Context

Marian version: v1.10.24; 4dd30b50 2021-09-08 14:02:21 +0100
Marian version: v1.11.0 f00d0621 2022-02-08 08:39:24 -0800
CMake command: cmake .. -- The CXX compiler identification is GNU 7.5.0 -- The C compiler identification is GNU 7.5.0 -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Project name: marian -- Project version: v1.11.0+f00d0621 Submodule 'examples' (https://github.com/marian-nmt/marian-examples) registered for path 'examples' Submodule 'regression-tests' (https://github.com/marian-nmt/marian-regression-tests) registered for path 'regression-tests' Submodule 'src/3rd_party/fbgemm' (https://github.com/marian-nmt/FBGEMM) registered for path 'src/3rd_party/fbgemm' Submodule 'src/3rd_party/intgemm' (https://github.com/marian-nmt/intgemm/) registered for path 'src/3rd_party/intgemm' Submodule 'src/3rd_party/nccl' (https://github.com/marian-nmt/nccl) registered for path 'src/3rd_party/nccl' Submodule 'src/3rd_party/sentencepiece' (https://github.com/marian-nmt/sentencepiece) registered for path 'src/3rd_party/sentencepiece' Submodule 'src/3rd_party/simple-websocket-server' (https://github.com/marian-nmt/Simple-WebSocket-Server) registered for path 'src/3rd_party/simple-websocket-server' Cloning into '/tmp/toms/sockeye-test/marian/examples'... Cloning into '/tmp/toms/sockeye-test/marian/regression-tests'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/intgemm'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/nccl'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/sentencepiece'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/simple-websocket-server'... Submodule path 'examples': checked out '6d5921cc7de91f4e915b59e9c52c9a76c4e99b00' Submodule path 'regression-tests': checked out '0716f4e012d1e3f7543bffa8aecc97ce9c903e17' Submodule path 'src/3rd_party/fbgemm': checked out '6f45243cb8ab7d7ab921af18d313ae97144618b8' Submodule 'third_party/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'src/3rd_party/fbgemm/third_party/asmjit' Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'src/3rd_party/fbgemm/third_party/cpuinfo' Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'src/3rd_party/fbgemm/third_party/googletest' Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm/third_party/asmjit'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm/third_party/cpuinfo'... Cloning into '/tmp/toms/sockeye-test/marian/src/3rd_party/fbgemm/third_party/googletest'... Submodule path 'src/3rd_party/fbgemm/third_party/asmjit': checked out '4da474ac9aa2689e88d5e40a2f37628f302d7e3c' Submodule path 'src/3rd_party/fbgemm/third_party/cpuinfo': checked out 'd5e37adf1406cf899d7d9ec1d317c47506ccb970' Submodule path 'src/3rd_party/fbgemm/third_party/googletest': checked out '0fc5466dbb9e623029b1ada539717d10bd45e99e' Submodule path 'src/3rd_party/intgemm': checked out '8abde25b13c3ab210c0dec8e23f4944e3953812d' Submodule path 'src/3rd_party/nccl': checked out '5dcf7751494f9d04057bfc6b4a2b64611bc12253' Submodule path 'src/3rd_party/sentencepiece': checked out 'c307b874deb5ea896db8f93506e173353e66d4d3' Submodule path 'src/3rd_party/simple-websocket-server': checked out '1d7e84aeb3f1ebdc78f6965d79ad3ca3003789fe' CMake Warning at CMakeLists.txt:79 (message): CMAKE_BUILD_TYPE not set; setting to Release

-- Building with -march=native and intrinsics will be chosen automatically by the compiler to match the current machine. -- Checking support for CPU intrinsics -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found CUDA: software/anaconda3/envs/sockeye3 (found suitable version "10.0", minimum required is "9.0") -- Compiling code for Pascal GPUs -- Compiling code for Volta GPUs -- Compiling code for Turing GPUs -- Found CUDA libraries: software/anaconda3/envs/sockeye3/lib64/libcurand.so; software/anaconda3/envs/sockeye3/lib64/libcusparse.so; software/anaconda3/envs/sockeye3/lib64/libcublas.so -- Found Tcmalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so -- Found MKL: -Wl,--start-group;/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a;/opt/intel/mkl/lib/intel64/libmkl_sequential.a;/opt/intel/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group CMake Warning at src/3rd_party/intgemm/CMakeLists.txt:33 (message): Not building AVX512VNNI-based multiplication because your compiler is too old.

For details rerun cmake with --debug-trycompile then try to build in compile_tests/CMakeFiles/CMakeTmp.

-- VERSION: 0.1.94 -- Found TCMalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so -- Found Doxygen: /usr/bin/doxygen (found version "1.8.13") found components: doxygen dot -- Configuring done -- Generating done -- Build files have been written to: /tmp/toms/sockeye-test/marian/build

Both frameworks use CUDA Version 10 although there could be minor differences, as Sockeye 3 is installed via Conda and uses its installation.
I ran it on two NVIDIA TITAN RTXs

Ubuntu 18.04.6

marian-v-1.10.train.log marian-v-1.11.train.log sockye_training.log sockeye.args.yaml.txt sockeye.data.config.txt marian-v-1.10.train.log

marian-nmt / marian