marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.21k stars 227 forks source link

Sockeye is training much faster than Marian #396

Open tomsbergmanis opened 1 year ago

tomsbergmanis commented 1 year ago

Bug description

Sockeye is training much faster than Marian. I run a 1 data epoch long training on a 4.7M training examples small data set with either framework. To best of my knowledge I used comparable training parameters for both frameworks. Bu the results were 21 min vs 36 min, favoring Sockeye. What I do not know is, if it is a problem due to my old setup - Ubuntu 18.04.6 and everything that follows from that (e.g. old compiler and other stuff), or it something to do with Marian.

How to reproduce

A typical way of training Sockeye systems is to run data prep step before training. sockeye-prepare-data --source train.bpe.en --target train.bpe.lv --output . --max-seq-len 128 --shared-vocab --num-words 25000 Data prep time was not included in training time. To measure Sockeye's training time I used timestamps between start and end of the training, which to me worked out to be 21 min. touch sockeye.start & torchrun --no_python --nproc_per_node 2 sockeye-train --prepared-data . --output models --validation-source dev.bpe.en --validation-target dev.bpe.lv --max-num-epochs 1 --shared-vocab --dist --amp --update-interval 12 --batch-size 18000--max-seq-len 128 > training.log 2>&1 & touch sockeye.end image For Marian I used /marian-vocab --max-size 25000 marian --devices 0 1 --type transformer --model /tmp/toms/sockeye-test/model.npz --train-sets /tmp/toms/sockeye-test/train.bpe.en /tmp/toms/sockeye-test/train.bpe.lv --vocabs en-lv-shared-vocab.yml en-lv-shared-vocab.yml --max-length 128 --max-length-factor 1.5 --mini-batch-fit --workspace 18000 --maxi-batch 2000 --early-stopping 10 --valid-freq 1000000 --save-freq 2000000 --disp-freq 100 --keep-best --overwrite --valid-metrics cross-entropy translation --valid-sets /tmp/toms/sockeye-test/dev.bpe.en /tmp/toms/sockeye-test/dev.bpe.lv --valid-script-path /tmp/toms/sockeye-test/validate.sh --log /tmp/toms/sockeye-test/train.log --valid-log /tmp/toms/sockeye-test/valid.log --seed 347155 --exponential-smoothing --normalize 0.6 --beam-size 6 --quiet-translation --valid-translation-output /tmp/toms/sockeye-test/valid.output.txt --valid-mini-batch 16 --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-preprocess d --transformer-postprocess-emb d --transformer-postprocess dan --optimizer-delay 12 --learn-rate 0.0005 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --clip-norm 5 --tied-embeddings-all --sync-sgd --transformer-dropout 0.1 --transformer-dropout-attention 0.1 --transformer-dropout-ffn 0.1 --optimizer adam --optimizer-params 0.9 0.98 1e-09 --sqlite /tmp/en-lv-W69bwc2f6meuT-combined.db -e 1 --fp16 image To measure Marian's training time I used timestamps for outputs Training started and Training finished which to me worked out to be around 36 min. This was with Marian version: v1.10.24; 4dd30b50 2021-09-08 14:02:21 +0100 I also tried Marian v1.11.0 f00d0621 2022-02-08 08:39:24 -0800 but it gave even worse - 43 min.

I do realize, that Marian's --workspace 18000 and Sockeye's --batch-size 18000 aren't the same, however, running with different --batch-size values didn't affect time it took Sockeye to train for one epoch.

I also checked if both frameworks have seen the same number of sentences during their respective training runs. The numbers were about the same.

Context

-- Building with -march=native and intrinsics will be chosen automatically by the compiler to match the current machine. -- Checking support for CPU intrinsics -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found CUDA: software/anaconda3/envs/sockeye3 (found suitable version "10.0", minimum required is "9.0") -- Compiling code for Pascal GPUs -- Compiling code for Volta GPUs -- Compiling code for Turing GPUs -- Found CUDA libraries: software/anaconda3/envs/sockeye3/lib64/libcurand.so; software/anaconda3/envs/sockeye3/lib64/libcusparse.so; software/anaconda3/envs/sockeye3/lib64/libcublas.so -- Found Tcmalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so -- Found MKL: -Wl,--start-group;/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a;/opt/intel/mkl/lib/intel64/libmkl_sequential.a;/opt/intel/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group CMake Warning at src/3rd_party/intgemm/CMakeLists.txt:33 (message): Not building AVX512VNNI-based multiplication because your compiler is too old.

For details rerun cmake with --debug-trycompile then try to build in compile_tests/CMakeFiles/CMakeTmp.

-- VERSION: 0.1.94 -- Found TCMalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so -- Found Doxygen: /usr/bin/doxygen (found version "1.8.13") found components: doxygen dot -- Configuring done -- Generating done -- Build files have been written to: /tmp/toms/sockeye-test/marian/build

Ubuntu 18.04.6

marian-v-1.10.train.log marian-v-1.11.train.log sockye_training.log sockeye.args.yaml.txt sockeye.data.config.txt marian-v-1.10.train.log