problem with workspace > 26000

jorgtied commented 5 years ago

marian throws an error message when training with workspaces > 26000 (tested on a V100 with 32GB memory):

[2019-10-22 17:37:55] Compiled without MPI support. Falling back to FakeMPIWrapper
[2019-10-22 17:37:55] [batching] Collecting statistics for batch fitting with step size 10
[2019-10-22 17:37:55] [memory] Extending reserved space to 27008 MB (device gpu0)
[2019-10-22 17:37:55] [comm] Using NCCL 2.4.2 for GPU communication
[2019-10-22 17:37:55] [comm] NCCLCommunicator constructed successfully.
[2019-10-22 17:37:55] [training] Using 1 GPUs
[2019-10-22 17:37:55] [logits] applyLossFunction() for 1 factors
[2019-10-22 17:37:55] [memory] Reserving 295 MB, device gpu0
[2019-10-22 17:37:55] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2019-10-22 17:37:55] [memory] Reserving 295 MB, device gpu0
[2019-10-22 17:37:57] Error: Labels not matching logits shape??
[2019-10-22 17:37:57] Error: Aborted from marian::Expr marian::Logits::applyLossFunction(const Words&, const std::function<std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase> > >(std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase> > >, std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase> > >)>&) const in /users/tiedeman/projappl/marian-dev/src/layers/generic.cpp:26

[CALL STACK]
[0x9b2c38]          marian::Logits::  applyLossFunction  (std::vector<marian::Word,std::allocator<marian::Word>> const&,  std::function<std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>> (std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>,std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>)> const&) const + 0x418
[0xcc704b]          marian::CrossEntropyLoss::  compute  (marian::Logits,  std::vector<marian::Word,std::allocator<marian::Word>> const&,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>) + 0x5b
[0xcca18d]          marian::LabelwiseLoss::  apply  (marian::Logits,  std::vector<marian::Word,std::allocator<marian::Word>> const&,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>) + 0x34d
[0xa742dc]          marian::models::EncoderDecoderCECost::  apply  (std::shared_ptr<marian::models::IModel>,  std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0x24c
[0x72d0b5]                                                            
[0x7fad8f]          marian::GraphGroup::  collectStats  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::models::ICriterionFunction>,  std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&,  double) + 0xb5f
[0xad0d59]          marian::SyncGraphGroup::  collectStats  (std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&) + 0x1a9
[0x8068cc]          marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x30c
[0x72f489]          mainTrainer  (int,  char**)                        + 0x249
[0x706be5]          main                                               + 0x25
[0x2b9601bea545]    __libc_start_main                                  + 0xf5
[0x72c1ec]

This is compiled with boost 1.68 and gcc 8.3.0.

Other command line parameters (besides data, log files and word alignment for the guided alignment feature):


--mini-batch-fit -w 27000 --maxi-batch 500 --early-stopping 10 --valid-freq 10000 --save-freq 10000 --disp-freq 10000 --valid-metrics perplexity --valid-mini-batch 16 --beam-size 12 --normalize 1  --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings-all --overwrite --keep-best --devices 0 --sync-sgd --seed 1111 --sqlite --exponential-smoothing```

emjotde commented 5 years ago

This is interesting. I don't really have the possibility right now to test 32GB GPUs, maybe in a few weeks.