marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 127 forks source link

problem with workspace > 26000 #515

Open jorgtied opened 5 years ago

jorgtied commented 5 years ago

marian throws an error message when training with workspaces > 26000 (tested on a V100 with 32GB memory):

[2019-10-22 17:37:55] Compiled without MPI support. Falling back to FakeMPIWrapper
[2019-10-22 17:37:55] [batching] Collecting statistics for batch fitting with step size 10
[2019-10-22 17:37:55] [memory] Extending reserved space to 27008 MB (device gpu0)
[2019-10-22 17:37:55] [comm] Using NCCL 2.4.2 for GPU communication
[2019-10-22 17:37:55] [comm] NCCLCommunicator constructed successfully.
[2019-10-22 17:37:55] [training] Using 1 GPUs
[2019-10-22 17:37:55] [logits] applyLossFunction() for 1 factors
[2019-10-22 17:37:55] [memory] Reserving 295 MB, device gpu0
[2019-10-22 17:37:55] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2019-10-22 17:37:55] [memory] Reserving 295 MB, device gpu0
[2019-10-22 17:37:57] Error: Labels not matching logits shape??
[2019-10-22 17:37:57] Error: Aborted from marian::Expr marian::Logits::applyLossFunction(const Words&, const std::function<std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase> > >(std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase> > >, std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase> > >)>&) const in /users/tiedeman/projappl/marian-dev/src/layers/generic.cpp:26

[CALL STACK]
[0x9b2c38]          marian::Logits::  applyLossFunction  (std::vector<marian::Word,std::allocator<marian::Word>> const&,  std::function<std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>> (std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>,std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>)> const&) const + 0x418
[0xcc704b]          marian::CrossEntropyLoss::  compute  (marian::Logits,  std::vector<marian::Word,std::allocator<marian::Word>> const&,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>) + 0x5b
[0xcca18d]          marian::LabelwiseLoss::  apply  (marian::Logits,  std::vector<marian::Word,std::allocator<marian::Word>> const&,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>,  std::shared_ptr<marian::Chainable<std::shared_ptr<marian::TensorBase>>>) + 0x34d
[0xa742dc]          marian::models::EncoderDecoderCECost::  apply  (std::shared_ptr<marian::models::IModel>,  std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::Batch>,  bool) + 0x24c
[0x72d0b5]                                                            
[0x7fad8f]          marian::GraphGroup::  collectStats  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::models::ICriterionFunction>,  std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&,  double) + 0xb5f
[0xad0d59]          marian::SyncGraphGroup::  collectStats  (std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&) + 0x1a9
[0x8068cc]          marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x30c
[0x72f489]          mainTrainer  (int,  char**)                        + 0x249
[0x706be5]          main                                               + 0x25
[0x2b9601bea545]    __libc_start_main                                  + 0xf5
[0x72c1ec]                                                            

This is compiled with boost 1.68 and gcc 8.3.0.

Other command line parameters (besides data, log files and word alignment for the guided alignment feature):


--mini-batch-fit -w 27000 --maxi-batch 500 --early-stopping 10 --valid-freq 10000 --save-freq 10000 --disp-freq 10000 --valid-metrics perplexity --valid-mini-batch 16 --beam-size 12 --normalize 1  --enc-depth 6 --dec-depth 6 --transformer-heads 8 --transformer-postprocess-emb d --transformer-postprocess dan --transformer-dropout 0.1 --label-smoothing 0.1 --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --tied-embeddings-all --overwrite --keep-best --devices 0 --sync-sgd --seed 1111 --sqlite --exponential-smoothing```
emjotde commented 5 years ago

This is interesting. I don't really have the possibility right now to test 32GB GPUs, maybe in a few weeks.

emjotde commented 4 years ago

Have 32GB GPUs now, gonna work on this.

emjotde commented 4 years ago

@jorgtied Can you tell me the size of your parallel corpus in tokens and sentences?

jorgtied commented 4 years ago

One of the corpora that failed has 30 million sentence pairs and 412 million tokens on one side and 507 on the other (this is already split into sub word units using sentence piece).

emjotde commented 4 years ago

OK, this is caused by an overflow for shape::elements() due to using int instead of size_t or int64_t. Fixing this will take a while. I ran into problems with this in other places (i.e. training huge models) so this is moving now to the top of my TODO list. GPU memory is growing faster than I expected back in 2016 :)

jorgtied commented 4 years ago

Good to know that you found the problem. Would you have some kind of estimate when there could be some fix? I'd like to restart some big models with maximum memory usage soon again. Thanks!

emjotde commented 4 years ago

I have a branch that fixes that but runs about 20% slower at the moment (mjd/dimtype if you want to try, I will make it available in a second). Unfortunately, this is a very tricky thing to get right and fast as I need to change the computation type for shapes throughout the entire codebase.

If your models are large you might not actually run into that problem. This happens when the total product of vocabulary size times embeddings dimension times words in a batch exceeds 2 billion. For large models your batch might never be that large.

emjotde commented 4 years ago

Branch mjd/dimtype should be available. A bit slower for now, as I replaced the types more or less blindly which seems to come with a performance penalty for cases where it's not actually required.

kpu commented 4 years ago

IIRC GPUs don't have a native 64-bit int type which is why you would see a penalty.

emjotde commented 4 years ago

Yeah, I figured. It's not too bad as now I have at least a version where the front-end is 64-bit everywhere and I only need to adapt the kernels to do int32 whenever it's possible. Seems doable.

emjotde commented 4 years ago

In most cases that's broken down into a product of threads times blocks anyway, so that should be easy enough. Threads times blocks times wrap-around actually, so even better.

fiqas commented 3 years ago

I have the same problem when I'm trying to train on GeForce RTX 3090 when using workspace 20GB:

[2021-05-12 17:03:15] Error: Labels not matching logits shape (2621440000 != -1673527296, shape=1x10x8192x32000 size=-1673527296)??

@snukky , I'm testing on surtr machine.

emjotde commented 3 years ago

This is a known issue. Your model dimensions are exceeding 32-bit integer sizes somewhere. I tried to fix it a while ago but it resulted in a significant slow down. It's a lot of work unfortunately to get that right.

emjotde commented 3 years ago

@snukky do you want to learn GPU programming? :)

fiqas commented 3 years ago

This exact model is student.tiny11 from @snukky, but those GPUs are untested so maybe something is going on with those.

eu9ene commented 3 years ago

I have the same problem on GPU Quadro RTX 6000 (24 Gb) using 21000 workspace size with the mentioned above student model. Teacher model training works fine.

Error: Labels not matching logits shape (2621440000 != -1673527296, shape=1x10x8192x32000 size=-1673527296)??

Setting workspace to 16000 solves the problem.

student model config:


[2021-07-16 00:53:52] [config] after: 0e
[2021-07-16 00:53:52] [config] after-batches: 0
[2021-07-16 00:53:52] [config] after-epochs: 0
[2021-07-16 00:53:52] [config] all-caps-every: 0
[2021-07-16 00:53:52] [config] allow-unk: false
[2021-07-16 00:53:52] [config] authors: false
[2021-07-16 00:53:52] [config] beam-size: 1
[2021-07-16 00:53:52] [config] bert-class-symbol: "[CLS]"
[2021-07-16 00:53:52] [config] bert-mask-symbol: "[MASK]"
[2021-07-16 00:53:52] [config] bert-masking-fraction: 0.15
[2021-07-16 00:53:52] [config] bert-sep-symbol: "[SEP]"
[2021-07-16 00:53:52] [config] bert-train-type-embeddings: true
[2021-07-16 00:53:52] [config] bert-type-vocab-size: 2
[2021-07-16 00:53:52] [config] build-info: ""
[2021-07-16 00:53:52] [config] cite: false
[2021-07-16 00:53:52] [config] clip-gemm: 0
[2021-07-16 00:53:52] [config] clip-norm: 0
[2021-07-16 00:53:52] [config] cost-scaling:
[2021-07-16 00:53:52] [config]   []
[2021-07-16 00:53:52] [config] cost-type: ce-mean-words
[2021-07-16 00:53:52] [config] cpu-threads: 0
[2021-07-16 00:53:52] [config] data-weighting: ""
[2021-07-16 00:53:52] [config] data-weighting-type: sentence
[2021-07-16 00:53:52] [config] dec-cell: ssru
[2021-07-16 00:53:52] [config] dec-cell-base-depth: 2
[2021-07-16 00:53:52] [config] dec-cell-high-depth: 1
[2021-07-16 00:53:52] [config] dec-depth: 2
[2021-07-16 00:53:52] [config] devices:
[2021-07-16 00:53:52] [config]   - 0
[2021-07-16 00:53:52] [config]   - 1
[2021-07-16 00:53:52] [config]   - 2
[2021-07-16 00:53:52] [config]   - 3
[2021-07-16 00:53:52] [config]   - 4
[2021-07-16 00:53:52] [config]   - 5
[2021-07-16 00:53:52] [config]   - 6
[2021-07-16 00:53:52] [config]   - 7
[2021-07-16 00:53:52] [config] dim-emb: 256
[2021-07-16 00:53:52] [config] dim-rnn: 1024
[2021-07-16 00:53:52] [config] dim-vocabs:
[2021-07-16 00:53:52] [config]   - 32000
[2021-07-16 00:53:52] [config]   - 32000
[2021-07-16 00:53:52] [config] disp-first: 10
[2021-07-16 00:53:52] [config] disp-freq: 1000
[2021-07-16 00:53:52] [config] disp-label-counts: true
[2021-07-16 00:53:52] [config] dropout-rnn: 0
[2021-07-16 00:53:52] [config] dropout-src: 0
[2021-07-16 00:53:52] [config] dropout-trg: 0
[2021-07-16 00:53:52] [config] dump-config: ""
[2021-07-16 00:53:52] [config] early-stopping: 20
[2021-07-16 00:53:52] [config] embedding-fix-src: false
[2021-07-16 00:53:52] [config] embedding-fix-trg: false
[2021-07-16 00:53:52] [config] embedding-normalization: false
[2021-07-16 00:53:52] [config] embedding-vectors:
[2021-07-16 00:53:53] [config]   []
[2021-07-16 00:53:53] [config] enc-cell: gru
[2021-07-16 00:53:53] [config] enc-cell-depth: 1
[2021-07-16 00:53:53] [config] enc-depth: 6
[2021-07-16 00:53:53] [config] enc-type: bidirectional
[2021-07-16 00:53:53] [config] english-title-case-every: 0
[2021-07-16 00:53:53] [config] exponential-smoothing: True
[2021-07-16 00:53:53] [config] factor-weight: 1
[2021-07-16 00:53:53] [config] grad-dropping-momentum: 0
[2021-07-16 00:53:53] [config] grad-dropping-rate: 0
[2021-07-16 00:53:53] [config] grad-dropping-warmup: 100
[2021-07-16 00:53:53] [config] gradient-checkpointing: false
[2021-07-16 00:53:53] [config] guided-alignment: /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/alignment/corpus.aln.gz
[2021-07-16 00:53:53] [config] guided-alignment-cost: mse
[2021-07-16 00:53:53] [config] guided-alignment-weight: 0.1
[2021-07-16 00:53:53] [config] ignore-model-config: false
[2021-07-16 00:53:53] [config] input-types:
[2021-07-16 00:53:53] [config]   []
[2021-07-16 00:53:53] [config] interpolate-env-vars: false
[2021-07-16 00:53:53] [config] keep-best: true
[2021-07-16 00:53:53] [config] label-smoothing: 0
[2021-07-16 00:53:53] [config] layer-normalization: false
[2021-07-16 00:53:53] [config] learn-rate: 0.0003
[2021-07-16 00:53:53] [config] lemma-dim-emb: 0
[2021-07-16 00:53:53] [config] log: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/train.log
[2021-07-16 00:53:53] [config] log-level: info
[2021-07-16 00:53:53] [config] log-time-zone: ""
[2021-07-16 00:53:53] [config] logical-epoch:
[2021-07-16 00:53:53] [config]   - 1e
[2021-07-16 00:53:53] [config]   - 0
[2021-07-16 00:53:53] [config] lr-decay: 0
[2021-07-16 00:53:53] [config] lr-decay-freq: 50000
[2021-07-16 00:53:53] [config] lr-decay-inv-sqrt:
[2021-07-16 00:53:53] [config]   - 32000
[2021-07-16 00:53:53] [config] lr-decay-repeat-warmup: false
[2021-07-16 00:53:53] [config] lr-decay-reset-optimizer: false
[2021-07-16 00:53:53] [config] lr-decay-start:
[2021-07-16 00:53:53] [config]   - 10
[2021-07-16 00:53:53] [config]   - 1
[2021-07-16 00:53:53] [config] lr-decay-strategy: epoch+stalled
[2021-07-16 00:53:53] [config] lr-report: True
[2021-07-16 00:53:53] [config] lr-warmup: 16000
[2021-07-16 00:53:53] [config] lr-warmup-at-reload: false
[2021-07-16 00:53:53] [config] lr-warmup-cycle: false
[2021-07-16 00:53:53] [config] lr-warmup-start-rate: 0
[2021-07-16 00:53:53] [config] max-length: 200
[2021-07-16 00:53:53] [config] max-length-crop: false
[2021-07-16 00:53:53] [config] max-length-factor: 3
[2021-07-16 00:53:53] [config] maxi-batch: 1000
[2021-07-16 00:53:53] [config] maxi-batch-sort: trg
[2021-07-16 00:53:53] [config] mini-batch: 1000
[2021-07-16 00:53:53] [config] mini-batch-fit: True
[2021-07-16 00:53:53] [config] mini-batch-fit-step: 10
[2021-07-16 00:53:53] [config] mini-batch-track-lr: false
[2021-07-16 00:53:53] [config] mini-batch-warmup: 0
[2021-07-16 00:53:53] [config] mini-batch-words: 0
[2021-07-16 00:53:53] [config] mini-batch-words-ref: 0
[2021-07-16 00:53:53] [config] model: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/model.npz
[2021-07-16 00:53:53] [config] multi-loss-type: sum
[2021-07-16 00:53:53] [config] multi-node: false
[2021-07-16 00:53:53] [config] multi-node-overlap: true
[2021-07-16 00:53:53] [config] n-best: false
[2021-07-16 00:53:53] [config] no-nccl: false
[2021-07-16 00:53:53] [config] no-reload: false
[2021-07-16 00:53:53] [config] no-restore-corpus: false
[2021-07-16 00:53:53] [config] normalize: 1
[2021-07-16 00:53:53] [config] normalize-gradient: false
[2021-07-16 00:53:53] [config] num-devices: 0
[2021-07-16 00:53:53] [config] optimizer: adam
[2021-07-16 00:53:53] [config] optimizer-delay: 2
[2021-07-16 00:53:53] [config] optimizer-params:
[2021-07-16 00:53:53] [config]   - 0.9
[2021-07-16 00:53:53] [config]   - 0.98
[2021-07-16 00:53:53] [config]   - 1e-09
[2021-07-16 00:53:53] [config] output-omit-bias: false
[2021-07-16 00:53:53] [config] overwrite: true
[2021-07-16 00:53:53] [config] precision:
[2021-07-16 00:53:53] [config]   - float32
[2021-07-16 00:53:53] [config]   - float32
[2021-07-16 00:53:53] [config]   - float32
[2021-07-16 00:53:53] [config] pretrained-model: ""
[2021-07-16 00:53:53] [config] quantize-biases: false
[2021-07-16 00:53:53] [config] quantize-bits: 0
[2021-07-16 00:53:53] [config] quantize-log-based: false
[2021-07-16 00:53:53] [config] quantize-optimization-steps: 0
[2021-07-16 00:53:53] [config] quiet: false
[2021-07-16 00:53:53] [config] quiet-translation: true
[2021-07-16 00:53:53] [config] relative-paths: false
[2021-07-16 00:53:53] [config] right-left: false
[2021-07-16 00:53:53] [config] save-freq: 5000
[2021-07-16 00:53:53] [config] seed: 0
[2021-07-16 00:53:53] [config] sentencepiece-alphas:
[2021-07-16 00:53:53] [config]   []
[2021-07-16 00:53:53] [config] sentencepiece-max-lines: 2000000
[2021-07-16 00:53:53] [config] sentencepiece-options: ""
[2021-07-16 00:53:53] [config] shuffle: data
[2021-07-16 00:53:53] [config] shuffle-in-ram: true
[2021-07-16 00:53:53] [config] sigterm: save-and-exit
[2021-07-16 00:53:53] [config] skip: false
[2021-07-16 00:53:53] [config] sqlite: ""
[2021-07-16 00:53:53] [config] sqlite-drop: false
[2021-07-16 00:53:53] [config] sync-sgd: true
[2021-07-16 00:53:53] [config] tempdir: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/tmp
[2021-07-16 00:53:53] [config] tied-embeddings: false
[2021-07-16 00:53:53] [config] tied-embeddings-all: true
[2021-07-16 00:53:53] [config] tied-embeddings-src: false
[2021-07-16 00:53:53] [config] train-embedder-rank:
[2021-07-16 00:53:53] [config]   []
[2021-07-16 00:53:53] [config] train-sets:
[2021-07-16 00:53:53] [config]   - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/filtered/corpus.ru.gz
[2021-07-16 00:53:53] [config]   - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/filtered/corpus.en.gz
[2021-07-16 00:53:53] [config] transformer-aan-activation: swish
[2021-07-16 00:53:53] [config] transformer-aan-depth: 2
[2021-07-16 00:53:53] [config] transformer-aan-nogate: false
[2021-07-16 00:53:53] [config] transformer-decoder-autoreg: rnn
[2021-07-16 00:53:53] [config] transformer-depth-scaling: false
[2021-07-16 00:53:53] [config] transformer-dim-aan: 2048
[2021-07-16 00:53:53] [config] transformer-dim-ffn: 1536
[2021-07-16 00:53:53] [config] transformer-dropout: 0
[2021-07-16 00:53:53] [config] transformer-dropout-attention: 0
[2021-07-16 00:53:53] [config] transformer-dropout-ffn: 0
[2021-07-16 00:53:53] [config] transformer-ffn-activation: relu
[2021-07-16 00:53:53] [config] transformer-ffn-depth: 2
[2021-07-16 00:53:53] [config] transformer-guided-alignment-layer: last
[2021-07-16 00:53:53] [config] transformer-heads: 8
[2021-07-16 00:53:53] [config] transformer-no-projection: false
[2021-07-16 00:53:53] [config] transformer-pool: false
[2021-07-16 00:53:53] [config] transformer-postprocess: dan
[2021-07-16 00:53:53] [config] transformer-postprocess-emb: d
[2021-07-16 00:53:53] [config] transformer-postprocess-top: ""
[2021-07-16 00:53:53] [config] transformer-preprocess: ""
[2021-07-16 00:53:53] [config] transformer-tied-layers:
[2021-07-16 00:53:53] [config]   []
[2021-07-16 00:53:53] [config] transformer-train-position-embeddings: false
[2021-07-16 00:53:53] [config] tsv: false
[2021-07-16 00:53:53] [config] tsv-fields: 0
[2021-07-16 00:53:53] [config] type: transformer
[2021-07-16 00:53:53] [config] ulr: false
[2021-07-16 00:53:53] [config] ulr-dim-emb: 0
[2021-07-16 00:53:53] [config] ulr-dropout: 0
[2021-07-16 00:53:53] [config] ulr-keys-vectors: ""
[2021-07-16 00:53:53] [config] ulr-query-vectors: ""
[2021-07-16 00:53:53] [config] ulr-softmax-temperature: 1
[2021-07-16 00:53:53] [config] ulr-trainable-transformation: false
[2021-07-16 00:53:53] [config] unlikelihood-loss: false
[2021-07-16 00:53:53] [config] valid-freq: 5000
[2021-07-16 00:53:53] [config] valid-log: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/valid.log
[2021-07-16 00:53:53] [config] valid-max-length: 1000
[2021-07-16 00:53:53] [config] valid-metrics:
[2021-07-16 00:53:53] [config]   - bleu-detok
[2021-07-16 00:53:53] [config]   - ce-mean-words
[2021-07-16 00:53:53] [config]   - perplexity
[2021-07-16 00:53:53] [config] valid-mini-batch: 64
[2021-07-16 00:53:53] [config] valid-reset-stalled: false
[2021-07-16 00:53:53] [config] valid-script-args:
[2021-07-16 00:53:53] [config]   []
[2021-07-16 00:53:53] [config] valid-script-path: ""
[2021-07-16 00:53:53] [config] valid-sets:
[2021-07-16 00:53:53] [config]   - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/original/devset.ru.gz
[2021-07-16 00:53:53] [config]   - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/original/devset.en.gz
[2021-07-16 00:53:53] [config] valid-translation-output: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/devset.out
[2021-07-16 00:53:53] [config] vocabs:
[2021-07-16 00:53:53] [config]   - /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/vocab.spm
[2021-07-16 00:53:53] [config]   - /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/vocab.spm
[2021-07-16 00:53:53] [config] word-penalty: 0
[2021-07-16 00:53:53] [config] word-scores: false
[2021-07-16 00:53:53] [config] workspace: 16000
[2021-07-16 00:53:53] [config] Model is being created with Marian v1.9.56 94aeaa46 2021-04-28 00:28:35 +0100
alvations commented 2 years ago

Can confirm that I'm also having the same issue with workspace > 20GB, but on RTX A6000 GPU. I'm using Lambda Lab's instance on https://lambdalabs.com/service/gpu-cloud

sukuya commented 1 year ago

Can confirm the same on A100 with 70000 as workspace using Marian 1.11.0

JOHW85 commented 1 year ago

Seems to be fixed on 1.12.0. Able to run 41644 workspace (with fp16) on RTX A6000s. Used to had problems with fp16 (but fine with fp32) prior to this version.