Closed 520jefferson closed 5 years ago
I think that might be an old version of NCCL. We require >2.0. So you can either install a newer version from NVidia's website or you can build without NCCL with
cmake .. -DUSE_NCCL=off
It seems the cmake module for NCCL does not check for version number, I will try to fix this soon.
Thanks, when i use cmake .. -DUSE_NCCL=off , i can build successfully. But, if i build like this , may i use mul-gpus to train or decode?
Yes, we still have the old code for multi-gpu training in there, it's used when NCCL is not available. I do however recommend to install NCCL 2,2, it can be quite a bit faster when the number of GPUs is 2, 4, or 8. It can be slower if the number of GPUs is not a power of 2.
Ah, and NCCL is currently only used with --sync-sgd
.
Okay, many thanks, I will update to NCCL2 and try again.
I will keep this issue open for now to remind myself to see if I can add version checking for NCCL.
@emjotde I met a problem when i train with marain-dev-1.6, logs as follows:
[2018-08-12 13:53:21] [config] [] [2018-08-12 13:53:21] [config] type: transformer [2018-08-12 13:53:21] [config] valid-freq: 10000 [2018-08-12 13:53:21] [config] valid-max-length: 1000 [2018-08-12 13:53:21] [config] valid-metrics: [2018-08-12 13:53:21] [config] - cross-entropy [2018-08-12 13:53:21] [config] valid-mini-batch: 32 [2018-08-12 13:53:21] [config] vocabs: [2018-08-12 13:53:21] [config] - /work/nmt/corpus/train180623_CK/punctuate/k2c/res.kr.src.bpe_integrate.pkl.json [2018-08-12 13:53:21] [config] - /work/nmt/corpus/train180623_CK/punctuate/k2c/res.ch.tar.bpe_integrate.pkl.json [2018-08-12 13:53:21] [config] word-penalty: 0 [2018-08-12 13:53:21] [config] workspace: 17000 [2018-08-12 13:53:21] [data] Loading vocabulary from JSON/Yaml file /work/nmt/corpus/train180623_CK/punctuate/k2c/res.kr.src.bpe_integrate.pkl.json [2018-08-12 13:53:21] [data] Using unused word id eos for 0 [2018-08-12 13:53:21] [data] Using unused word id UNK for 1 [2018-08-12 13:53:21] [data] Setting vocabulary size for input 0 to 42452 [2018-08-12 13:53:21] [data] Loading vocabulary from JSON/Yaml file /work/nmt/corpus/train180623_CK/punctuate/k2c/res.ch.tar.bpe_integrate.pkl.json [2018-08-12 13:53:21] [data] Using unused word id eos for 0 [2018-08-12 13:53:21] [data] Using unused word id UNK for 1 [2018-08-12 13:53:21] [data] Setting vocabulary size for input 1 to 45541 [2018-08-12 13:53:21] [batching] Collecting statistics for batch fitting with step size 10 [2018-08-12 13:53:28] [memory] Extending reserved space to 17024 MB (device gpu4) [2018-08-12 13:53:29] [memory] Extending reserved space to 17024 MB (device gpu5) [2018-08-12 13:53:30] [memory] Extending reserved space to 17024 MB (device gpu6) [2018-08-12 13:53:30] [memory] Extending reserved space to 17024 MB (device gpu7) [2018-08-12 13:53:30] [comm] Using NCCL library for GPU communication [2018-08-12 13:53:31] Requested shape shape=45541x512 size=23316992 for existing parameter 'Wemb' does not match original shape shape=42452x512 size=21735424 Aborted from marian::Expr marian::ExpressionGraph::param(const string&, const marian::Shape&, const NodeInitializer&, bool) in /wlj/marian-dev-1.6.0/src/graph/expression_graph.h: 311 Aborted (core dumped)
I start training with gdb, then i get this:
(gdb) bt
Are you trying to use --tied-embedding-all
with different vocabulary files? That won't work.
Yes , i use --tied-embedding-all and use two different vocabulary files. how should i fix this ? converge the vocabulary files? or just vocab size?
In order to tie source and target embeddings, source and target vocabulary must be the same file. You can use
cat source_text target_text | marian-vocab > common.vocab.yml
When i tie tie the vocab, it does work. So i'm a little confused. First , when i use marain-dev-1.2 and use --tied-embedding-all and use two different vocabulary files, it can train normally. So, does marain-dev-1.6 just adjust the logic ? Second,--tied-embedding-all can minis model size substantially, does it will harm the effect of the model?
1) That should not be the case, as far as I know tie-embeddings-all always required a single common vocabulary from the start, please review your old configuration. There should be no changes. 2) Embeddings are now shared between source and target, so there is only one embedding matrix, hence a smaller size is naturally expected. As for effects, it depends. I use this always for languages with the same script, like German and English. Usually it's even a bit better. For languages with different scripts like Chinese and English I would not use it.
I'm just train with --tied-embedding-all and two different vocabulary,but i limit the vocab size to 30000 with marian-dev-1.2, training log as follows:
[2018-08-09 12:01:36] [config] after-batches: 20000000 [2018-08-09 12:01:36] [config] after-epochs: 0 [2018-08-09 12:01:36] [config] allow-unk: false [2018-08-09 12:01:36] [config] batch-flexible-lr: false [2018-08-09 12:01:36] [config] batch-normal-words: 1920 [2018-08-09 12:01:36] [config] beam-size: 6 [2018-08-09 12:01:36] [config] best-deep: false [2018-08-09 12:01:36] [config] clip-norm: 5 [2018-08-09 12:01:36] [config] cost-type: ce-mean [2018-08-09 12:01:36] [config] dec-cell: gru [2018-08-09 12:01:36] [config] dec-cell-base-depth: 2 [2018-08-09 12:01:36] [config] dec-cell-high-depth: 1 [2018-08-09 12:01:36] [config] dec-depth: 6 [2018-08-09 12:01:36] [config] devices: [2018-08-09 12:01:36] [config] - 0 [2018-08-09 12:01:36] [config] - 1 [2018-08-09 12:01:36] [config] - 2 [2018-08-09 12:01:36] [config] dim-emb: 512 [2018-08-09 12:01:36] [config] dim-rnn: 1024 [2018-08-09 12:01:36] [config] dim-vocabs: [2018-08-09 12:01:36] [config] - 30000 [2018-08-09 12:01:36] [config] - 30000 [2018-08-09 12:01:36] [config] disp-freq: 500 [2018-08-09 12:01:36] [config] dropout-rnn: 0 [2018-08-09 12:01:36] [config] dropout-src: 0 [2018-08-09 12:01:36] [config] dropout-trg: 0 [2018-08-09 12:01:36] [config] early-stopping: 10 [2018-08-09 12:01:36] [config] embedding-fix-src: false [2018-08-09 12:01:36] [config] embedding-fix-trg: false [2018-08-09 12:01:36] [config] embedding-normalization: false [2018-08-09 12:01:36] [config] enc-cell: gru [2018-08-09 12:01:36] [config] enc-cell-depth: 1 [2018-08-09 12:01:36] [config] enc-depth: 6 [2018-08-09 12:01:36] [config] enc-type: bidirectional [2018-08-09 12:01:36] [config] exponential-smoothing: 0.0001 [2018-08-09 12:01:36] [config] gradient-dropping: 0 [2018-08-09 12:01:36] [config] guided-alignment-cost: ce [2018-08-09 12:01:36] [config] guided-alignment-weight: 1 [2018-08-09 12:01:36] [config] ignore-model-config: false [2018-08-09 12:01:36] [config] keep-best: false [2018-08-09 12:01:36] [config] label-smoothing: 0.1 [2018-08-09 12:01:36] [config] layer-normalization: false [2018-08-09 12:01:36] [config] learn-rate: 0.0003 [2018-08-09 12:01:36] [config] log: train.log.180809.tf.deep [2018-08-09 12:01:36] [config] log-level: info [2018-08-09 12:01:36] [config] lr-decay: 0 [2018-08-09 12:01:36] [config] lr-decay-freq: 50000 [2018-08-09 12:01:36] [config] lr-decay-inv-sqrt: 16000 [2018-08-09 12:01:36] [config] lr-decay-repeat-warmup: false [2018-08-09 12:01:36] [config] lr-decay-reset-optimizer: false [2018-08-09 12:01:36] [config] lr-decay-start: [2018-08-09 12:01:36] [config] - 10 [2018-08-09 12:01:36] [config] - 1 [2018-08-09 12:01:36] [config] lr-decay-strategy: epoch+stalled [2018-08-09 12:01:36] [config] lr-report: true [2018-08-09 12:01:36] [config] lr-warmup: 16000 [2018-08-09 12:01:36] [config] lr-warmup-at-reload: false [2018-08-09 12:01:36] [config] lr-warmup-cycle: false [2018-08-09 12:01:36] [config] lr-warmup-start-rate: 0 [2018-08-09 12:01:36] [config] max-length: 100 [2018-08-09 12:01:36] [config] max-length-crop: false [2018-08-09 12:01:36] [config] maxi-batch: 1000 [2018-08-09 12:01:36] [config] maxi-batch-sort: trg [2018-08-09 12:01:36] [config] mini-batch: 64 [2018-08-09 12:01:36] [config] mini-batch-fit: true [2018-08-09 12:01:36] [config] mini-batch-words: 0 [2018-08-09 12:01:36] [config] model: models180809_transf/512-1024-zh-ko.npz [2018-08-09 12:01:36] [config] n-best: false [2018-08-09 12:01:36] [config] no-reload: false [2018-08-09 12:01:36] [config] no-shuffle: false [2018-08-09 12:01:36] [config] normalize: 0.6 [2018-08-09 12:01:36] [config] optimizer: adam [2018-08-09 12:01:36] [config] optimizer-delay: 1 [2018-08-09 12:01:36] [config] optimizer-params: [2018-08-09 12:01:36] [config] - 0.9 [2018-08-09 12:01:36] [config] - 0.98 [2018-08-09 12:01:36] [config] - 1e-09 [2018-08-09 12:01:36] [config] overwrite: false [2018-08-09 12:01:36] [config] quiet: false [2018-08-09 12:01:36] [config] quiet-translation: true [2018-08-09 12:01:36] [config] relative-paths: false [2018-08-09 12:01:36] [config] save-freq: 5000 [2018-08-09 12:01:36] [config] seed: 1111 [2018-08-09 12:01:36] [config] skip: false [2018-08-09 12:01:36] [config] sync-sgd: true [2018-08-09 12:01:36] [config] tempdir: /tmp [2018-08-09 12:01:36] [config] tied-embeddings: false [2018-08-09 12:01:36] [config] tied-embeddings-all: true [2018-08-09 12:01:36] [config] tied-embeddings-src: false [2018-08-09 12:01:36] [config] train-sets: [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.ch.src.bpe_integrate [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.kr.tar.bpe_integrate [2018-08-09 12:01:36] [config] transformer-dim-ffn: 2048 [2018-08-09 12:01:36] [config] transformer-dropout: 0.1 [2018-08-09 12:01:36] [config] transformer-dropout-attention: 0 [2018-08-09 12:01:36] [config] transformer-heads: 8 [2018-08-09 12:01:36] [config] transformer-postprocess: dan [2018-08-09 12:01:36] [config] transformer-postprocess-emb: d [2018-08-09 12:01:36] [config] transformer-preprocess: "" [2018-08-09 12:01:36] [config] type: transformer [2018-08-09 12:01:36] [config] valid-freq: 10000 [2018-08-09 12:01:36] [config] valid-max-length: 1000 [2018-08-09 12:01:36] [config] valid-metrics: [2018-08-09 12:01:36] [config] - cross-entropy [2018-08-09 12:01:36] [config] valid-mini-batch: 32 [2018-08-09 12:01:36] [config] vocabs: [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.ch.src.bpe_integrate.pkl.json [2018-08-09 12:01:36] [config] - /work/nmt/corpus/train180623_CK/punctuate/c2k/res.kr.tar.bpe_integrate.pkl.json [2018-08-09 12:01:36] [config] workspace: 18000 [2018-08-09 12:01:36] [data] Loading vocabulary from /work/nmt/corpus/train180623_CK/punctuate/c2k/res.ch.src.bpe_integrate.pkl.json [2018-08-09 12:01:36] [data] Setting vocabulary size for input 0 to 30000 [2018-08-09 12:01:36] [data] Loading vocabulary from /work/nmt/corpus/train180623_CK/punctuate/c2k/res.kr.tar.bpe_integrate.pkl.json [2018-08-09 12:01:37] [data] Setting vocabulary size for input 1 to 30000 [2018-08-09 12:01:37] [batching] Collecting statistics for batch fitting [2018-08-09 12:01:44] [memory] Extending reserved space to 18432 MB (device 0) [2018-08-09 12:01:44] [memory] Extending reserved space to 18432 MB (device 1) [2018-08-09 12:01:45] [memory] Extending reserved space to 18432 MB (device 2) [2018-08-09 12:01:45] [memory] Reserving 227 MB, device 0 [2018-08-09 12:01:46] [memory] Reserving 227 MB, device 0 [2018-08-09 12:11:59] [batching] Done [2018-08-09 12:11:59] [memory] Extending reserved space to 18432 MB (device 0) [2018-08-09 12:12:00] [memory] Extending reserved space to 18432 MB (device 1) [2018-08-09 12:12:00] [memory] Extending reserved space to 18432 MB (device 2) [2018-08-09 12:12:00] Training started [2018-08-09 12:12:00] [data] Shuffling files [2018-08-09 12:12:27] [data] Done
That will work technically but I do not think that is a good idea. Different vocab strings will be mapped to the same id for source and target. That may work to some extent, but you are definitely missing out on the advantage of having identical source and target words share a common embedding. Now different string share a common embedding. This seems wrong.
I built marain-1.6.0 then i get this,and when cmake .. (-- Found NCCL (include: /usr/local/include, library: /usr/local/lib/libnccl.so) , but i can't find ncclGroupStart in nccl.h, the specify info as follows:
/marian-dev-1.6.0/src/training/communicator.cu(64): error: identifier "ncclGroupStart" is undefined /marian-dev-1.6.0/src/training/communicator.cu(83): error: identifier "ncclGroupEnd" is undefined /marian-dev-1.6.0/src/training/communicator.cu(94): error: identifier "ncclGroupStart" is undefined /marian-dev-1.6.0/src/training/communicator.cu(103): error: argument of type "void " is incompatible with parameter of type "int" /marian-dev-1.6.0/src/training/communicator.cu(103): error: argument of type "int" is incompatible with parameter of type "ncclDataType_t" /marian-dev-1.6.0/src/training/communicator.cu(103): error: argument of type "ncclDataType_t" is incompatible with parameter of type "void " /marian-dev-1.6.0/src/training/communicator.cu(108): error: identifier "ncclGroupEnd" is undefined [ 67%] Building CXX object src/CMakeFiles/marian.dir/graph/node_operators.cpp.o [ 68%] Building CXX object src/CMakeFiles/marian.dir/graph/node_initializers.cpp.o [ 69%] Building CXX object src/CMakeFiles/marian.dir/layers/convolution.cpp.o [ 69%] Building CXX object src/CMakeFiles/marian.dir/layers/loss.cpp.o [ 70%] Building CXX object src/CMakeFiles/marian.dir/layers/weight.cpp.o 7 errors detected in the compilation of "/tmp/tmpxft_00005fa4_00000000-10_communicator.compute_61.cpp1.ii". CMake Error at marian_cuda_generated_communicator.cu.o.cmake:262 (message): Error generating file /marian-dev-1.6.0/build/src/CMakeFiles/marian_cuda.dir/training/./marian_cuda_generated_communicator.cu.o src/CMakeFiles/marian_cuda.dir/build.make:147: recipe for target 'src/CMakeFiles/marian_cuda.dir/training/marian_cuda_generated_communicator.cu.o' failed make[2]: [src/CMakeFiles/marian_cuda.dir/training/marian_cuda_generated_communicator.cu.o] Error 1 make[2]: Waiting for unfinished jobs....