marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 127 forks source link

marian-dev-1.6 build error #279

Closed 520jefferson closed 5 years ago

520jefferson commented 6 years ago

I built marain-1.6.0 then i get this,and when cmake .. (-- Found NCCL (include: /usr/local/include, library: /usr/local/lib/libnccl.so) , but i can't find ncclGroupStart in nccl.h, the specify info as follows:

/marian-dev-1.6.0/src/training/communicator.cu(64): error: identifier "ncclGroupStart" is undefined /marian-dev-1.6.0/src/training/communicator.cu(83): error: identifier "ncclGroupEnd" is undefined /marian-dev-1.6.0/src/training/communicator.cu(94): error: identifier "ncclGroupStart" is undefined /marian-dev-1.6.0/src/training/communicator.cu(103): error: argument of type "void " is incompatible with parameter of type "int" /marian-dev-1.6.0/src/training/communicator.cu(103): error: argument of type "int" is incompatible with parameter of type "ncclDataType_t" /marian-dev-1.6.0/src/training/communicator.cu(103): error: argument of type "ncclDataType_t" is incompatible with parameter of type "void " /marian-dev-1.6.0/src/training/communicator.cu(108): error: identifier "ncclGroupEnd" is undefined [ 67%] Building CXX object src/CMakeFiles/marian.dir/graph/node_operators.cpp.o [ 68%] Building CXX object src/CMakeFiles/marian.dir/graph/node_initializers.cpp.o [ 69%] Building CXX object src/CMakeFiles/marian.dir/layers/convolution.cpp.o [ 69%] Building CXX object src/CMakeFiles/marian.dir/layers/loss.cpp.o [ 70%] Building CXX object src/CMakeFiles/marian.dir/layers/weight.cpp.o 7 errors detected in the compilation of "/tmp/tmpxft_00005fa4_00000000-10_communicator.compute_61.cpp1.ii". CMake Error at marian_cuda_generated_communicator.cu.o.cmake:262 (message): Error generating file /marian-dev-1.6.0/build/src/CMakeFiles/marian_cuda.dir/training/./marian_cuda_generated_communicator.cu.o src/CMakeFiles/marian_cuda.dir/build.make:147: recipe for target 'src/CMakeFiles/marian_cuda.dir/training/marian_cuda_generated_communicator.cu.o' failed make[2]: [src/CMakeFiles/marian_cuda.dir/training/marian_cuda_generated_communicator.cu.o] Error 1 make[2]: Waiting for unfinished jobs....

emjotde commented 6 years ago

I think that might be an old version of NCCL. We require >2.0. So you can either install a newer version from NVidia's website or you can build without NCCL with

cmake .. -DUSE_NCCL=off

It seems the cmake module for NCCL does not check for version number, I will try to fix this soon.

520jefferson commented 6 years ago

Thanks, when i use cmake .. -DUSE_NCCL=off , i can build successfully. But, if i build like this , may i use mul-gpus to train or decode?

emjotde commented 6 years ago

Yes, we still have the old code for multi-gpu training in there, it's used when NCCL is not available. I do however recommend to install NCCL 2,2, it can be quite a bit faster when the number of GPUs is 2, 4, or 8. It can be slower if the number of GPUs is not a power of 2.

emjotde commented 6 years ago

Ah, and NCCL is currently only used with --sync-sgd.

520jefferson commented 6 years ago

Okay, many thanks, I will update to NCCL2 and try again.

emjotde commented 6 years ago

I will keep this issue open for now to remind myself to see if I can add version checking for NCCL.

520jefferson commented 6 years ago

@emjotde I met a problem when i train with marain-dev-1.6, logs as follows:

[2018-08-12 13:53:21] [config] [] [2018-08-12 13:53:21] [config] type: transformer [2018-08-12 13:53:21] [config] valid-freq: 10000 [2018-08-12 13:53:21] [config] valid-max-length: 1000 [2018-08-12 13:53:21] [config] valid-metrics: [2018-08-12 13:53:21] [config] - cross-entropy [2018-08-12 13:53:21] [config] valid-mini-batch: 32 [2018-08-12 13:53:21] [config] vocabs: [2018-08-12 13:53:21] [config] - /work/nmt/corpus/train180623_CK/punctuate/k2c/res.kr.src.bpe_integrate.pkl.json [2018-08-12 13:53:21] [config] - /work/nmt/corpus/train180623_CK/punctuate/k2c/res.ch.tar.bpe_integrate.pkl.json [2018-08-12 13:53:21] [config] word-penalty: 0 [2018-08-12 13:53:21] [config] workspace: 17000 [2018-08-12 13:53:21] [data] Loading vocabulary from JSON/Yaml file /work/nmt/corpus/train180623_CK/punctuate/k2c/res.kr.src.bpe_integrate.pkl.json [2018-08-12 13:53:21] [data] Using unused word id eos for 0 [2018-08-12 13:53:21] [data] Using unused word id UNK for 1 [2018-08-12 13:53:21] [data] Setting vocabulary size for input 0 to 42452 [2018-08-12 13:53:21] [data] Loading vocabulary from JSON/Yaml file /work/nmt/corpus/train180623_CK/punctuate/k2c/res.ch.tar.bpe_integrate.pkl.json [2018-08-12 13:53:21] [data] Using unused word id eos for 0 [2018-08-12 13:53:21] [data] Using unused word id UNK for 1 [2018-08-12 13:53:21] [data] Setting vocabulary size for input 1 to 45541 [2018-08-12 13:53:21] [batching] Collecting statistics for batch fitting with step size 10 [2018-08-12 13:53:28] [memory] Extending reserved space to 17024 MB (device gpu4) [2018-08-12 13:53:29] [memory] Extending reserved space to 17024 MB (device gpu5) [2018-08-12 13:53:30] [memory] Extending reserved space to 17024 MB (device gpu6) [2018-08-12 13:53:30] [memory] Extending reserved space to 17024 MB (device gpu7) [2018-08-12 13:53:30] [comm] Using NCCL library for GPU communication [2018-08-12 13:53:31] Requested shape shape=45541x512 size=23316992 for existing parameter 'Wemb' does not match original shape shape=42452x512 size=21735424 Aborted from marian::Expr marian::ExpressionGraph::param(const string&, const marian::Shape&, const NodeInitializer&, bool) in /wlj/marian-dev-1.6.0/src/graph/expression_graph.h: 311 Aborted (core dumped)

520jefferson commented 6 years ago

I start training with gdb, then i get this:

(gdb) bt

0 0x00007fffe5a70428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

1 0x00007fffe5a7202a in __GI_abort () at abort.c:89

2 0x00000000005e02df in marian::ExpressionGraph::param(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, marian::Shape const&, std::function<void (std::shared_ptr)> const&, bool) ()

3 0x00000000005f1d9f in marian::EmbeddingFactory::construct() ()

4 0x00000000005f3276 in marian::DecoderBase::embeddingsFromBatch(std::shared_ptr, std::shared_ptr, std::shared_ptr) ()

5 0x00000000006091c6 in marian::EncoderDecoder::stepAll(std::shared_ptr, std::shared_ptr, bool) ()

6 0x00000000005c824e in marian::models::EncoderDecoderCE::apply(std::shared_ptr, std::shared_ptr, std::shared_ptr, bool) ()

7 0x000000000058cbf5 in marian::models::Trainer::build(std::shared_ptr, std::shared_ptr, bool) ()

8 0x00000000004a66c2 in marian::GraphGroup::collectStats(std::shared_ptr, std::shared_ptr, unsigned long) ()

9 0x00000000004ab6bf in std::thread::_Impl<std::_Bind_simple<marian::Train::run()::{lambda()#1} ()> >::_M_run() ()

10 0x00007fffe63dc260 in ?? () from /opt/anaconda2/lib/libstdc++.so.6

11 0x00007fffe9d266ba in start_thread (arg=0x7fffa4902700) at pthread_create.c:333

12 0x00007fffe5b4241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

emjotde commented 6 years ago

Are you trying to use --tied-embedding-all with different vocabulary files? That won't work.

520jefferson commented 6 years ago

Yes , i use --tied-embedding-all and use two different vocabulary files. how should i fix this ? converge the vocabulary files? or just vocab size?

emjotde commented 6 years ago

In order to tie source and target embeddings, source and target vocabulary must be the same file. You can use

cat source_text target_text | marian-vocab > common.vocab.yml
520jefferson commented 6 years ago

When i tie tie the vocab, it does work. So i'm a little confused. First , when i use marain-dev-1.2 and use --tied-embedding-all and use two different vocabulary files, it can train normally. So, does marain-dev-1.6 just adjust the logic ? Second,--tied-embedding-all can minis model size substantially, does it will harm the effect of the model?

emjotde commented 6 years ago

1) That should not be the case, as far as I know tie-embeddings-all always required a single common vocabulary from the start, please review your old configuration. There should be no changes. 2) Embeddings are now shared between source and target, so there is only one embedding matrix, hence a smaller size is naturally expected. As for effects, it depends. I use this always for languages with the same script, like German and English. Usually it's even a bit better. For languages with different scripts like Chinese and English I would not use it.

520jefferson commented 6 years ago

I'm just train with --tied-embedding-all and two different vocabulary,but i limit the vocab size to 30000 with marian-dev-1.2, training log as follows:

[2018-08-09 12:01:36] [config] after-batches: 20000000 [2018-08-09 12:01:36] [config] after-epochs: 0 [2018-08-09 12:01:36] [config] allow-unk: false [2018-08-09 12:01:36] [config] batch-flexible-lr: false [2018-08-09 12:01:36] [config] batch-normal-words: 1920 [2018-08-09 12:01:36] [config] beam-size: 6 [2018-08-09 12:01:36] [config] best-deep: false [2018-08-09 12:01:36] [config] clip-norm: 5 [2018-08-09 12:01:36] [config] cost-type: ce-mean [2018-08-09 12:01:36] [config] dec-cell: gru [2018-08-09 12:01:36] [config] dec-cell-base-depth: 2 [2018-08-09 12:01:36] [config] dec-cell-high-depth: 1 [2018-08-09 12:01:36] [config] dec-depth: 6 [2018-08-09 12:01:36] [config] devices: [2018-08-09 12:01:36] [config] - 0 [2018-08-09 12:01:36] [config] - 1 [2018-08-09 12:01:36] [config] - 2 [2018-08-09 12:01:36] [config] dim-emb: 512 [2018-08-09 12:01:36] [config] dim-rnn: 1024 [2018-08-09 12:01:36] [config] dim-vocabs: [2018-08-09 12:01:36] [config] - 30000 [2018-08-09 12:01:36] [config] - 30000 [2018-08-09 12:01:36] [config] disp-freq: 500 [2018-08-09 12:01:36] [config] dropout-rnn: 0 [2018-08-09 12:01:36] [config] dropout-src: 0 [2018-08-09 12:01:36] [config] dropout-trg: 0 [2018-08-09 12:01:36] [config] early-stopping: 10 [2018-08-09 12:01:36] [config] embedding-fix-src: false [2018-08-09 12:01:36] [config] embedding-fix-trg: false [2018-08-09 12:01:36] [config] embedding-normalization: false [2018-08-09 12:01:36] [config] enc-cell: gru [2018-08-09 12:01:36] [config] enc-cell-depth: 1 [2018-08-09 12:01:36] [config] enc-depth: 6 [2018-08-09 12:01:36] [config] enc-type: bidirectional [2018-08-09 12:01:36] [config] exponential-smoothing: 0.0001 [2018-08-09 12:01:36] [config] gradient-dropping: 0 [2018-08-09 12:01:36] [config] guided-alignment-cost: ce [2018-08-09 12:01:36] [config] guided-alignment-weight: 1 [2018-08-09 12:01:36] [config] ignore-model-config: false [2018-08-09 12:01:36] [config] keep-best: false [2018-08-09 12:01:36] [config] label-smoothing: 0.1 [2018-08-09 12:01:36] [config] layer-normalization: false [2018-08-09 12:01:36] [config] learn-rate: 0.0003 [2018-08-09 12:01:36] [config] log: train.log.180809.tf.deep [2018-08-09 12:01:36] [config] log-level: info [2018-08-09 12:01:36] [config] lr-decay: 0 [2018-08-09 12:01:36] [config] lr-decay-freq: 50000 [2018-08-09 12:01:36] [config] lr-decay-inv-sqrt: 16000 [2018-08-09 12:01:36] [config] lr-decay-repeat-warmup: false [2018-08-09 12:01:36] [config] lr-decay-reset-optimizer: false [2018-08-09 12:01:36] [config] lr-decay-start: [2018-08-09 12:01:36] [config] - 10 [2018-08-09 12:01:36] [config] - 1 [2018-08-09 12:01:36] [config] lr-decay-strategy: epoch+stalled [2018-08-09 12:01:36] [config] lr-report: true [2018-08-09 12:01:36] [config] lr-warmup: 16000 [2018-08-09 12:01:36] [config] lr-warmup-at-reload: false [2018-08-09 12:01:36] [config] lr-warmup-cycle: false [2018-08-09 12:01:36] [config] lr-warmup-start-rate: 0 [2018-08-09 12:01:36] [config] max-length: 100 [2018-08-09 12:01:36] [config] max-length-crop: false [2018-08-09 12:01:36] [config] maxi-batch: 1000 [2018-08-09 12:01:36] [config] maxi-batch-sort: trg [2018-08-09 12:01:36] [config] mini-batch: 64 [2018-08-09 12:01:36] [config] mini-batch-fit: true [2018-08-09 12:01:36] [config] mini-batch-words: 0 [2018-08-09 12:01:36] [config] model: models180809_transf/512-1024-zh-ko.npz [2018-08-09 12:01:36] [config] n-best: false [2018-08-09 12:01:36] [config] no-reload: false [2018-08-09 12:01:36] [config] no-shuffle: false [2018-08-09 12:01:36] [config] normalize: 0.6 [2018-08-09 12:01:36] [config] optimizer: adam [2018-08-09 12:01:36] [config] optimizer-delay: 1 [2018-08-09 12:01:36] [config] optimizer-params: [2018-08-09 12:01:36] [config] - 0.9 [2018-08-09 12:01:36] [config] - 0.98 [2018-08-09 12:01:36] [config] - 1e-09 [2018-08-09 12:01:36] [config] overwrite: false [2018-08-09 12:01:36] [config] quiet: false [2018-08-09 12:01:36] [config] quiet-translation: true [2018-08-09 12:01:36] [config] relative-paths: false [2018-08-09 12:01:36] [config] save-freq: 5000 [2018-08-09 12:01:36] [config] seed: 1111 [2018-08-09 12:01:36] [config] skip: false [2018-08-09 12:01:36] [config] sync-sgd: true [2018-08-09 12:01:36] [config] tempdir: /tmp [2018-08-09 12:01:36] [config] tied-embeddings: false [2018-08-09 12:01:36] [config] tied-embeddings-all: true [2018-08-09 12:01:36] [config] tied-embeddings-src: false [2018-08-09 12:01:36] [config] train-sets: [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.ch.src.bpe_integrate [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.kr.tar.bpe_integrate [2018-08-09 12:01:36] [config] transformer-dim-ffn: 2048 [2018-08-09 12:01:36] [config] transformer-dropout: 0.1 [2018-08-09 12:01:36] [config] transformer-dropout-attention: 0 [2018-08-09 12:01:36] [config] transformer-heads: 8 [2018-08-09 12:01:36] [config] transformer-postprocess: dan [2018-08-09 12:01:36] [config] transformer-postprocess-emb: d [2018-08-09 12:01:36] [config] transformer-preprocess: "" [2018-08-09 12:01:36] [config] type: transformer [2018-08-09 12:01:36] [config] valid-freq: 10000 [2018-08-09 12:01:36] [config] valid-max-length: 1000 [2018-08-09 12:01:36] [config] valid-metrics: [2018-08-09 12:01:36] [config] - cross-entropy [2018-08-09 12:01:36] [config] valid-mini-batch: 32 [2018-08-09 12:01:36] [config] vocabs: [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.ch.src.bpe_integrate.pkl.json [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.kr.tar.bpe_integrate.pkl.json [2018-08-09 12:01:36] [config] workspace: 18000 [2018-08-09 12:01:36] [data] Loading vocabulary from /work/nmt/corpus/train180623_CK/punctuate/c2k/res.ch.src.bpe_integrate.pkl.json [2018-08-09 12:01:36] [data] Setting vocabulary size for input 0 to 30000 [2018-08-09 12:01:36] [data] Loading vocabulary from /work/nmt/corpus/train180623_CK/punctuate/c2k/res.kr.tar.bpe_integrate.pkl.json [2018-08-09 12:01:37] [data] Setting vocabulary size for input 1 to 30000 [2018-08-09 12:01:37] [batching] Collecting statistics for batch fitting [2018-08-09 12:01:44] [memory] Extending reserved space to 18432 MB (device 0) [2018-08-09 12:01:44] [memory] Extending reserved space to 18432 MB (device 1) [2018-08-09 12:01:45] [memory] Extending reserved space to 18432 MB (device 2) [2018-08-09 12:01:45] [memory] Reserving 227 MB, device 0 [2018-08-09 12:01:46] [memory] Reserving 227 MB, device 0 [2018-08-09 12:11:59] [batching] Done [2018-08-09 12:11:59] [memory] Extending reserved space to 18432 MB (device 0) [2018-08-09 12:12:00] [memory] Extending reserved space to 18432 MB (device 1) [2018-08-09 12:12:00] [memory] Extending reserved space to 18432 MB (device 2) [2018-08-09 12:12:00] Training started [2018-08-09 12:12:00] [data] Shuffling files [2018-08-09 12:12:27] [data] Done

emjotde commented 6 years ago

That will work technically but I do not think that is a good idea. Different vocab strings will be mapped to the same id for source and target. That may work to some extent, but you are definitely missing out on the advantage of having identical source and target words share a common embedding. Now different string share a common embedding. This seems wrong.