Open jorgtied opened 5 years ago
This is interesting. I don't really have the possibility right now to test 32GB GPUs, maybe in a few weeks.
Have 32GB GPUs now, gonna work on this.
@jorgtied Can you tell me the size of your parallel corpus in tokens and sentences?
One of the corpora that failed has 30 million sentence pairs and 412 million tokens on one side and 507 on the other (this is already split into sub word units using sentence piece).
OK, this is caused by an overflow for shape::elements() due to using int
instead of size_t
or int64_t
. Fixing this will take a while. I ran into problems with this in other places (i.e. training huge models) so this is moving now to the top of my TODO list. GPU memory is growing faster than I expected back in 2016 :)
Good to know that you found the problem. Would you have some kind of estimate when there could be some fix? I'd like to restart some big models with maximum memory usage soon again. Thanks!
I have a branch that fixes that but runs about 20% slower at the moment (mjd/dimtype
if you want to try, I will make it available in a second). Unfortunately, this is a very tricky thing to get right and fast as I need to change the computation type for shapes throughout the entire codebase.
If your models are large you might not actually run into that problem. This happens when the total product of vocabulary size
times embeddings dimension
times words in a batch
exceeds 2 billion. For large models your batch might never be that large.
Branch mjd/dimtype
should be available. A bit slower for now, as I replaced the types more or less blindly which seems to come with a performance penalty for cases where it's not actually required.
IIRC GPUs don't have a native 64-bit int type which is why you would see a penalty.
Yeah, I figured. It's not too bad as now I have at least a version where the front-end is 64-bit everywhere and I only need to adapt the kernels to do int32 whenever it's possible. Seems doable.
In most cases that's broken down into a product of threads times blocks anyway, so that should be easy enough. Threads times blocks times wrap-around actually, so even better.
I have the same problem when I'm trying to train on GeForce RTX 3090 when using workspace 20GB:
[2021-05-12 17:03:15] Error: Labels not matching logits shape (2621440000 != -1673527296, shape=1x10x8192x32000 size=-1673527296)??
@snukky , I'm testing on surtr machine.
This is a known issue. Your model dimensions are exceeding 32-bit integer sizes somewhere. I tried to fix it a while ago but it resulted in a significant slow down. It's a lot of work unfortunately to get that right.
@snukky do you want to learn GPU programming? :)
This exact model is student.tiny11 from @snukky, but those GPUs are untested so maybe something is going on with those.
I have the same problem on GPU Quadro RTX 6000 (24 Gb) using 21000 workspace size with the mentioned above student model. Teacher model training works fine.
Error: Labels not matching logits shape (2621440000 != -1673527296, shape=1x10x8192x32000 size=-1673527296)??
Setting workspace to 16000 solves the problem.
student model config:
[2021-07-16 00:53:52] [config] after: 0e
[2021-07-16 00:53:52] [config] after-batches: 0
[2021-07-16 00:53:52] [config] after-epochs: 0
[2021-07-16 00:53:52] [config] all-caps-every: 0
[2021-07-16 00:53:52] [config] allow-unk: false
[2021-07-16 00:53:52] [config] authors: false
[2021-07-16 00:53:52] [config] beam-size: 1
[2021-07-16 00:53:52] [config] bert-class-symbol: "[CLS]"
[2021-07-16 00:53:52] [config] bert-mask-symbol: "[MASK]"
[2021-07-16 00:53:52] [config] bert-masking-fraction: 0.15
[2021-07-16 00:53:52] [config] bert-sep-symbol: "[SEP]"
[2021-07-16 00:53:52] [config] bert-train-type-embeddings: true
[2021-07-16 00:53:52] [config] bert-type-vocab-size: 2
[2021-07-16 00:53:52] [config] build-info: ""
[2021-07-16 00:53:52] [config] cite: false
[2021-07-16 00:53:52] [config] clip-gemm: 0
[2021-07-16 00:53:52] [config] clip-norm: 0
[2021-07-16 00:53:52] [config] cost-scaling:
[2021-07-16 00:53:52] [config] []
[2021-07-16 00:53:52] [config] cost-type: ce-mean-words
[2021-07-16 00:53:52] [config] cpu-threads: 0
[2021-07-16 00:53:52] [config] data-weighting: ""
[2021-07-16 00:53:52] [config] data-weighting-type: sentence
[2021-07-16 00:53:52] [config] dec-cell: ssru
[2021-07-16 00:53:52] [config] dec-cell-base-depth: 2
[2021-07-16 00:53:52] [config] dec-cell-high-depth: 1
[2021-07-16 00:53:52] [config] dec-depth: 2
[2021-07-16 00:53:52] [config] devices:
[2021-07-16 00:53:52] [config] - 0
[2021-07-16 00:53:52] [config] - 1
[2021-07-16 00:53:52] [config] - 2
[2021-07-16 00:53:52] [config] - 3
[2021-07-16 00:53:52] [config] - 4
[2021-07-16 00:53:52] [config] - 5
[2021-07-16 00:53:52] [config] - 6
[2021-07-16 00:53:52] [config] - 7
[2021-07-16 00:53:52] [config] dim-emb: 256
[2021-07-16 00:53:52] [config] dim-rnn: 1024
[2021-07-16 00:53:52] [config] dim-vocabs:
[2021-07-16 00:53:52] [config] - 32000
[2021-07-16 00:53:52] [config] - 32000
[2021-07-16 00:53:52] [config] disp-first: 10
[2021-07-16 00:53:52] [config] disp-freq: 1000
[2021-07-16 00:53:52] [config] disp-label-counts: true
[2021-07-16 00:53:52] [config] dropout-rnn: 0
[2021-07-16 00:53:52] [config] dropout-src: 0
[2021-07-16 00:53:52] [config] dropout-trg: 0
[2021-07-16 00:53:52] [config] dump-config: ""
[2021-07-16 00:53:52] [config] early-stopping: 20
[2021-07-16 00:53:52] [config] embedding-fix-src: false
[2021-07-16 00:53:52] [config] embedding-fix-trg: false
[2021-07-16 00:53:52] [config] embedding-normalization: false
[2021-07-16 00:53:52] [config] embedding-vectors:
[2021-07-16 00:53:53] [config] []
[2021-07-16 00:53:53] [config] enc-cell: gru
[2021-07-16 00:53:53] [config] enc-cell-depth: 1
[2021-07-16 00:53:53] [config] enc-depth: 6
[2021-07-16 00:53:53] [config] enc-type: bidirectional
[2021-07-16 00:53:53] [config] english-title-case-every: 0
[2021-07-16 00:53:53] [config] exponential-smoothing: True
[2021-07-16 00:53:53] [config] factor-weight: 1
[2021-07-16 00:53:53] [config] grad-dropping-momentum: 0
[2021-07-16 00:53:53] [config] grad-dropping-rate: 0
[2021-07-16 00:53:53] [config] grad-dropping-warmup: 100
[2021-07-16 00:53:53] [config] gradient-checkpointing: false
[2021-07-16 00:53:53] [config] guided-alignment: /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/alignment/corpus.aln.gz
[2021-07-16 00:53:53] [config] guided-alignment-cost: mse
[2021-07-16 00:53:53] [config] guided-alignment-weight: 0.1
[2021-07-16 00:53:53] [config] ignore-model-config: false
[2021-07-16 00:53:53] [config] input-types:
[2021-07-16 00:53:53] [config] []
[2021-07-16 00:53:53] [config] interpolate-env-vars: false
[2021-07-16 00:53:53] [config] keep-best: true
[2021-07-16 00:53:53] [config] label-smoothing: 0
[2021-07-16 00:53:53] [config] layer-normalization: false
[2021-07-16 00:53:53] [config] learn-rate: 0.0003
[2021-07-16 00:53:53] [config] lemma-dim-emb: 0
[2021-07-16 00:53:53] [config] log: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/train.log
[2021-07-16 00:53:53] [config] log-level: info
[2021-07-16 00:53:53] [config] log-time-zone: ""
[2021-07-16 00:53:53] [config] logical-epoch:
[2021-07-16 00:53:53] [config] - 1e
[2021-07-16 00:53:53] [config] - 0
[2021-07-16 00:53:53] [config] lr-decay: 0
[2021-07-16 00:53:53] [config] lr-decay-freq: 50000
[2021-07-16 00:53:53] [config] lr-decay-inv-sqrt:
[2021-07-16 00:53:53] [config] - 32000
[2021-07-16 00:53:53] [config] lr-decay-repeat-warmup: false
[2021-07-16 00:53:53] [config] lr-decay-reset-optimizer: false
[2021-07-16 00:53:53] [config] lr-decay-start:
[2021-07-16 00:53:53] [config] - 10
[2021-07-16 00:53:53] [config] - 1
[2021-07-16 00:53:53] [config] lr-decay-strategy: epoch+stalled
[2021-07-16 00:53:53] [config] lr-report: True
[2021-07-16 00:53:53] [config] lr-warmup: 16000
[2021-07-16 00:53:53] [config] lr-warmup-at-reload: false
[2021-07-16 00:53:53] [config] lr-warmup-cycle: false
[2021-07-16 00:53:53] [config] lr-warmup-start-rate: 0
[2021-07-16 00:53:53] [config] max-length: 200
[2021-07-16 00:53:53] [config] max-length-crop: false
[2021-07-16 00:53:53] [config] max-length-factor: 3
[2021-07-16 00:53:53] [config] maxi-batch: 1000
[2021-07-16 00:53:53] [config] maxi-batch-sort: trg
[2021-07-16 00:53:53] [config] mini-batch: 1000
[2021-07-16 00:53:53] [config] mini-batch-fit: True
[2021-07-16 00:53:53] [config] mini-batch-fit-step: 10
[2021-07-16 00:53:53] [config] mini-batch-track-lr: false
[2021-07-16 00:53:53] [config] mini-batch-warmup: 0
[2021-07-16 00:53:53] [config] mini-batch-words: 0
[2021-07-16 00:53:53] [config] mini-batch-words-ref: 0
[2021-07-16 00:53:53] [config] model: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/model.npz
[2021-07-16 00:53:53] [config] multi-loss-type: sum
[2021-07-16 00:53:53] [config] multi-node: false
[2021-07-16 00:53:53] [config] multi-node-overlap: true
[2021-07-16 00:53:53] [config] n-best: false
[2021-07-16 00:53:53] [config] no-nccl: false
[2021-07-16 00:53:53] [config] no-reload: false
[2021-07-16 00:53:53] [config] no-restore-corpus: false
[2021-07-16 00:53:53] [config] normalize: 1
[2021-07-16 00:53:53] [config] normalize-gradient: false
[2021-07-16 00:53:53] [config] num-devices: 0
[2021-07-16 00:53:53] [config] optimizer: adam
[2021-07-16 00:53:53] [config] optimizer-delay: 2
[2021-07-16 00:53:53] [config] optimizer-params:
[2021-07-16 00:53:53] [config] - 0.9
[2021-07-16 00:53:53] [config] - 0.98
[2021-07-16 00:53:53] [config] - 1e-09
[2021-07-16 00:53:53] [config] output-omit-bias: false
[2021-07-16 00:53:53] [config] overwrite: true
[2021-07-16 00:53:53] [config] precision:
[2021-07-16 00:53:53] [config] - float32
[2021-07-16 00:53:53] [config] - float32
[2021-07-16 00:53:53] [config] - float32
[2021-07-16 00:53:53] [config] pretrained-model: ""
[2021-07-16 00:53:53] [config] quantize-biases: false
[2021-07-16 00:53:53] [config] quantize-bits: 0
[2021-07-16 00:53:53] [config] quantize-log-based: false
[2021-07-16 00:53:53] [config] quantize-optimization-steps: 0
[2021-07-16 00:53:53] [config] quiet: false
[2021-07-16 00:53:53] [config] quiet-translation: true
[2021-07-16 00:53:53] [config] relative-paths: false
[2021-07-16 00:53:53] [config] right-left: false
[2021-07-16 00:53:53] [config] save-freq: 5000
[2021-07-16 00:53:53] [config] seed: 0
[2021-07-16 00:53:53] [config] sentencepiece-alphas:
[2021-07-16 00:53:53] [config] []
[2021-07-16 00:53:53] [config] sentencepiece-max-lines: 2000000
[2021-07-16 00:53:53] [config] sentencepiece-options: ""
[2021-07-16 00:53:53] [config] shuffle: data
[2021-07-16 00:53:53] [config] shuffle-in-ram: true
[2021-07-16 00:53:53] [config] sigterm: save-and-exit
[2021-07-16 00:53:53] [config] skip: false
[2021-07-16 00:53:53] [config] sqlite: ""
[2021-07-16 00:53:53] [config] sqlite-drop: false
[2021-07-16 00:53:53] [config] sync-sgd: true
[2021-07-16 00:53:53] [config] tempdir: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/tmp
[2021-07-16 00:53:53] [config] tied-embeddings: false
[2021-07-16 00:53:53] [config] tied-embeddings-all: true
[2021-07-16 00:53:53] [config] tied-embeddings-src: false
[2021-07-16 00:53:53] [config] train-embedder-rank:
[2021-07-16 00:53:53] [config] []
[2021-07-16 00:53:53] [config] train-sets:
[2021-07-16 00:53:53] [config] - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/filtered/corpus.ru.gz
[2021-07-16 00:53:53] [config] - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/filtered/corpus.en.gz
[2021-07-16 00:53:53] [config] transformer-aan-activation: swish
[2021-07-16 00:53:53] [config] transformer-aan-depth: 2
[2021-07-16 00:53:53] [config] transformer-aan-nogate: false
[2021-07-16 00:53:53] [config] transformer-decoder-autoreg: rnn
[2021-07-16 00:53:53] [config] transformer-depth-scaling: false
[2021-07-16 00:53:53] [config] transformer-dim-aan: 2048
[2021-07-16 00:53:53] [config] transformer-dim-ffn: 1536
[2021-07-16 00:53:53] [config] transformer-dropout: 0
[2021-07-16 00:53:53] [config] transformer-dropout-attention: 0
[2021-07-16 00:53:53] [config] transformer-dropout-ffn: 0
[2021-07-16 00:53:53] [config] transformer-ffn-activation: relu
[2021-07-16 00:53:53] [config] transformer-ffn-depth: 2
[2021-07-16 00:53:53] [config] transformer-guided-alignment-layer: last
[2021-07-16 00:53:53] [config] transformer-heads: 8
[2021-07-16 00:53:53] [config] transformer-no-projection: false
[2021-07-16 00:53:53] [config] transformer-pool: false
[2021-07-16 00:53:53] [config] transformer-postprocess: dan
[2021-07-16 00:53:53] [config] transformer-postprocess-emb: d
[2021-07-16 00:53:53] [config] transformer-postprocess-top: ""
[2021-07-16 00:53:53] [config] transformer-preprocess: ""
[2021-07-16 00:53:53] [config] transformer-tied-layers:
[2021-07-16 00:53:53] [config] []
[2021-07-16 00:53:53] [config] transformer-train-position-embeddings: false
[2021-07-16 00:53:53] [config] tsv: false
[2021-07-16 00:53:53] [config] tsv-fields: 0
[2021-07-16 00:53:53] [config] type: transformer
[2021-07-16 00:53:53] [config] ulr: false
[2021-07-16 00:53:53] [config] ulr-dim-emb: 0
[2021-07-16 00:53:53] [config] ulr-dropout: 0
[2021-07-16 00:53:53] [config] ulr-keys-vectors: ""
[2021-07-16 00:53:53] [config] ulr-query-vectors: ""
[2021-07-16 00:53:53] [config] ulr-softmax-temperature: 1
[2021-07-16 00:53:53] [config] ulr-trainable-transformation: false
[2021-07-16 00:53:53] [config] unlikelihood-loss: false
[2021-07-16 00:53:53] [config] valid-freq: 5000
[2021-07-16 00:53:53] [config] valid-log: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/valid.log
[2021-07-16 00:53:53] [config] valid-max-length: 1000
[2021-07-16 00:53:53] [config] valid-metrics:
[2021-07-16 00:53:53] [config] - bleu-detok
[2021-07-16 00:53:53] [config] - ce-mean-words
[2021-07-16 00:53:53] [config] - perplexity
[2021-07-16 00:53:53] [config] valid-mini-batch: 64
[2021-07-16 00:53:53] [config] valid-reset-stalled: false
[2021-07-16 00:53:53] [config] valid-script-args:
[2021-07-16 00:53:53] [config] []
[2021-07-16 00:53:53] [config] valid-script-path: ""
[2021-07-16 00:53:53] [config] valid-sets:
[2021-07-16 00:53:53] [config] - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/original/devset.ru.gz
[2021-07-16 00:53:53] [config] - /data/rw/home/bergamot-training/data/ru-en/allopus_bicleaner05/original/devset.en.gz
[2021-07-16 00:53:53] [config] valid-translation-output: /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/devset.out
[2021-07-16 00:53:53] [config] vocabs:
[2021-07-16 00:53:53] [config] - /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/vocab.spm
[2021-07-16 00:53:53] [config] - /data/rw/home/bergamot-training/models/ru-en/allopus_bicleaner05/student/vocab.spm
[2021-07-16 00:53:53] [config] word-penalty: 0
[2021-07-16 00:53:53] [config] word-scores: false
[2021-07-16 00:53:53] [config] workspace: 16000
[2021-07-16 00:53:53] [config] Model is being created with Marian v1.9.56 94aeaa46 2021-04-28 00:28:35 +0100
Can confirm that I'm also having the same issue with workspace > 20GB, but on RTX A6000 GPU
. I'm using Lambda Lab's instance on https://lambdalabs.com/service/gpu-cloud
Can confirm the same on A100
with 70000 as workspace using Marian 1.11.0
Seems to be fixed on 1.12.0. Able to run 41644 workspace (with fp16) on RTX A6000s. Used to had problems with fp16 (but fine with fp32) prior to this version.
marian throws an error message when training with workspaces > 26000 (tested on a V100 with 32GB memory):
This is compiled with boost 1.68 and gcc 8.3.0.
Other command line parameters (besides data, log files and word alignment for the guided alignment feature):