Running into Cublas Error: 7 for target factors for marian 1.12

Bug description

Marian 1.12 (65bf82ffce52f4854295d8b98482534f176d494e) runs into this error for target factored data:

[2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream)
[2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698

How to reproduce

Run marian 1.12 compiled against CUDA 11+ with target factors.

I am trying to train marian models from scratch using factored data. It succeeds for source factors, but source-and-target factors or target factor trainings fail the CUBLAS check.

I compile 65bf82ffce52f4854295d8b98482534f176d494e in a docker container and have tried this with a set of cuda-, nvidia- and marian-versions on ubuntu 22.04 and 18.04 Variants that were tried:

marian 1.12  | cuda 12.3.1  | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails
marian 1.12  | cuda 11.8    | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails
marian 1.11  | cuda 12.2.0  | nvidia 525.85.12              | ubuntu 20.04 -> fails
marian 1.11  | cuda 11.8    | nvidia 525.85.12              | ubuntu 20.04 -> fails
marian 1.11  | cuda 10.2    | nvidia 525.85.12 or 550.54.14 | ubuntu 18.04 -> works

Context

Marian output

+ /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000
[2024-04-18 08:40:13] [marian] Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800
[2024-04-18 08:40:13] [marian] Running on 25b1c50316d0 as process 33 with command line:
[2024-04-18 08:40:13] [marian] /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000
[2024-04-18 08:40:13] [config] after: 0e
[2024-04-18 08:40:13] [config] after-batches: 0
[2024-04-18 08:40:13] [config] after-epochs: 500
[2024-04-18 08:40:13] [config] all-caps-every: 0
[2024-04-18 08:40:13] [config] allow-unk: false
[2024-04-18 08:40:13] [config] authors: false
[2024-04-18 08:40:13] [config] beam-size: 6
[2024-04-18 08:40:13] [config] bert-class-symbol: "[CLS]"
[2024-04-18 08:40:13] [config] bert-mask-symbol: "[MASK]"
[2024-04-18 08:40:13] [config] bert-masking-fraction: 0.15
[2024-04-18 08:40:13] [config] bert-sep-symbol: "[SEP]"
[2024-04-18 08:40:13] [config] bert-train-type-embeddings: true
[2024-04-18 08:40:13] [config] bert-type-vocab-size: 2
[2024-04-18 08:40:13] [config] build-info: ""
[2024-04-18 08:40:13] [config] check-gradient-nan: false
[2024-04-18 08:40:13] [config] check-nan: false
[2024-04-18 08:40:13] [config] cite: false
[2024-04-18 08:40:13] [config] clip-norm: 5
[2024-04-18 08:40:13] [config] cost-scaling:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] cost-type: ce-sum
[2024-04-18 08:40:13] [config] cpu-threads: 0
[2024-04-18 08:40:13] [config] data-threads: 8
[2024-04-18 08:40:13] [config] data-weighting: ""
[2024-04-18 08:40:13] [config] data-weighting-type: sentence
[2024-04-18 08:40:13] [config] dec-cell: ssru
[2024-04-18 08:40:13] [config] dec-cell-base-depth: 2
[2024-04-18 08:40:13] [config] dec-cell-high-depth: 1
[2024-04-18 08:40:13] [config] dec-depth: 6
[2024-04-18 08:40:13] [config] devices:
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config]   - 1
[2024-04-18 08:40:13] [config]   - 2
[2024-04-18 08:40:13] [config]   - 3
[2024-04-18 08:40:13] [config] dim-emb: 512
[2024-04-18 08:40:13] [config] dim-rnn: 1024
[2024-04-18 08:40:13] [config] dim-vocabs:
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config] disp-first: 0
[2024-04-18 08:40:13] [config] disp-freq: 500
[2024-04-18 08:40:13] [config] disp-label-counts: true
[2024-04-18 08:40:13] [config] dropout-rnn: 0
[2024-04-18 08:40:13] [config] dropout-src: 0
[2024-04-18 08:40:13] [config] dropout-trg: 0
[2024-04-18 08:40:13] [config] dump-config: ""
[2024-04-18 08:40:13] [config] dynamic-gradient-scaling:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] early-stopping: 3
[2024-04-18 08:40:13] [config] early-stopping-on: first
[2024-04-18 08:40:13] [config] embedding-fix-src: false
[2024-04-18 08:40:13] [config] embedding-fix-trg: false
[2024-04-18 08:40:13] [config] embedding-normalization: false
[2024-04-18 08:40:13] [config] embedding-vectors:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] enc-cell: gru
[2024-04-18 08:40:13] [config] enc-cell-depth: 1
[2024-04-18 08:40:13] [config] enc-depth: 6
[2024-04-18 08:40:13] [config] enc-type: bidirectional
[2024-04-18 08:40:13] [config] english-title-case-every: 0
[2024-04-18 08:40:13] [config] exponential-smoothing: 0.0001
[2024-04-18 08:40:13] [config] factor-weight: 1
[2024-04-18 08:40:13] [config] factors-combine: sum
[2024-04-18 08:40:13] [config] factors-dim-emb: 0
[2024-04-18 08:40:13] [config] gradient-checkpointing: false
[2024-04-18 08:40:13] [config] gradient-norm-average-window: 100
[2024-04-18 08:40:13] [config] guided-alignment: data/train.tok.tc.clean.bpe.en.en-de.align
[2024-04-18 08:40:13] [config] guided-alignment-cost: ce
[2024-04-18 08:40:13] [config] guided-alignment-weight: 0.1
[2024-04-18 08:40:13] [config] ignore-model-config: false
[2024-04-18 08:40:13] [config] input-types:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] interpolate-env-vars: false
[2024-04-18 08:40:13] [config] keep-best: true
[2024-04-18 08:40:13] [config] label-smoothing: 0.1
[2024-04-18 08:40:13] [config] layer-normalization: false
[2024-04-18 08:40:13] [config] learn-rate: 0.0003
[2024-04-18 08:40:13] [config] lemma-dependency: ""
[2024-04-18 08:40:13] [config] lemma-dim-emb: 0
[2024-04-18 08:40:13] [config] log: ""
[2024-04-18 08:40:13] [config] log-level: info
[2024-04-18 08:40:13] [config] log-time-zone: ""
[2024-04-18 08:40:13] [config] logical-epoch:
[2024-04-18 08:40:13] [config]   - 1e
[2024-04-18 08:40:13] [config]   - 0
[2024-04-18 08:40:13] [config] lr-decay: 0
[2024-04-18 08:40:13] [config] lr-decay-freq: 50000
[2024-04-18 08:40:13] [config] lr-decay-inv-sqrt:
[2024-04-18 08:40:13] [config]   - 16000
[2024-04-18 08:40:13] [config] lr-decay-repeat-warmup: false
[2024-04-18 08:40:13] [config] lr-decay-reset-optimizer: false
[2024-04-18 08:40:13] [config] lr-decay-start:
[2024-04-18 08:40:13] [config]   - 10
[2024-04-18 08:40:13] [config]   - 1
[2024-04-18 08:40:13] [config] lr-decay-strategy: epoch+stalled
[2024-04-18 08:40:13] [config] lr-report: true
[2024-04-18 08:40:13] [config] lr-warmup: 16000
[2024-04-18 08:40:13] [config] lr-warmup-at-reload: false
[2024-04-18 08:40:13] [config] lr-warmup-cycle: false
[2024-04-18 08:40:13] [config] lr-warmup-start-rate: 0
[2024-04-18 08:40:13] [config] max-length: 100
[2024-04-18 08:40:13] [config] max-length-crop: false
[2024-04-18 08:40:13] [config] max-length-factor: 3
[2024-04-18 08:40:13] [config] maxi-batch: 1000
[2024-04-18 08:40:13] [config] maxi-batch-sort: trg
[2024-04-18 08:40:13] [config] mini-batch: 64
[2024-04-18 08:40:13] [config] mini-batch-fit: true
[2024-04-18 08:40:13] [config] mini-batch-fit-step: 10
[2024-04-18 08:40:13] [config] mini-batch-round-up: true
[2024-04-18 08:40:13] [config] mini-batch-track-lr: false
[2024-04-18 08:40:13] [config] mini-batch-warmup: 0
[2024-04-18 08:40:13] [config] mini-batch-words: 0
[2024-04-18 08:40:13] [config] mini-batch-words-ref: 0
[2024-04-18 08:40:13] [config] model: /data/training/model/model.npz
[2024-04-18 08:40:13] [config] multi-loss-type: sum
[2024-04-18 08:40:13] [config] n-best: false
[2024-04-18 08:40:13] [config] no-nccl: false
[2024-04-18 08:40:13] [config] no-reload: false
[2024-04-18 08:40:13] [config] no-restore-corpus: false
[2024-04-18 08:40:13] [config] normalize: 0.6
[2024-04-18 08:40:13] [config] normalize-gradient: false
[2024-04-18 08:40:13] [config] num-devices: 0
[2024-04-18 08:40:13] [config] optimizer: adam
[2024-04-18 08:40:13] [config] optimizer-delay: 1
[2024-04-18 08:40:13] [config] optimizer-params:
[2024-04-18 08:40:13] [config]   - 0.9
[2024-04-18 08:40:13] [config]   - 0.98
[2024-04-18 08:40:13] [config]   - 1e-09
[2024-04-18 08:40:13] [config] output-omit-bias: false
[2024-04-18 08:40:13] [config] overwrite: false
[2024-04-18 08:40:13] [config] precision:
[2024-04-18 08:40:13] [config]   - float32
[2024-04-18 08:40:13] [config]   - float32
[2024-04-18 08:40:13] [config] pretrained-model: ""
[2024-04-18 08:40:13] [config] quantize-biases: false
[2024-04-18 08:40:13] [config] quantize-bits: 0
[2024-04-18 08:40:13] [config] quantize-log-based: false
[2024-04-18 08:40:13] [config] quantize-optimization-steps: 0
[2024-04-18 08:40:13] [config] quiet: false
[2024-04-18 08:40:13] [config] quiet-translation: true
[2024-04-18 08:40:13] [config] relative-paths: false
[2024-04-18 08:40:13] [config] right-left: false
[2024-04-18 08:40:13] [config] save-freq: 10
[2024-04-18 08:40:13] [config] seed: 1111
[2024-04-18 08:40:13] [config] sharding: global
[2024-04-18 08:40:13] [config] shuffle: data
[2024-04-18 08:40:13] [config] shuffle-in-ram: false
[2024-04-18 08:40:13] [config] sigterm: save-and-exit
[2024-04-18 08:40:13] [config] skip: false
[2024-04-18 08:40:13] [config] sqlite: ""
[2024-04-18 08:40:13] [config] sqlite-drop: false
[2024-04-18 08:40:13] [config] sync-freq: 200u
[2024-04-18 08:40:13] [config] sync-sgd: true
[2024-04-18 08:40:13] [config] tempdir: marian-tmp
[2024-04-18 08:40:13] [config] tied-embeddings: true
[2024-04-18 08:40:13] [config] tied-embeddings-all: false
[2024-04-18 08:40:13] [config] tied-embeddings-src: false
[2024-04-18 08:40:13] [config] train-embedder-rank:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] train-sets:
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.clean.bpe.en
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.factorized.clean.bpe.de
[2024-04-18 08:40:13] [config] transformer-aan-activation: swish
[2024-04-18 08:40:13] [config] transformer-aan-depth: 2
[2024-04-18 08:40:13] [config] transformer-aan-nogate: false
[2024-04-18 08:40:13] [config] transformer-decoder-autoreg: rnn
[2024-04-18 08:40:13] [config] transformer-decoder-dim-ffn: 0
[2024-04-18 08:40:13] [config] transformer-decoder-ffn-depth: 0
[2024-04-18 08:40:13] [config] transformer-depth-scaling: false
[2024-04-18 08:40:13] [config] transformer-dim-aan: 2048
[2024-04-18 08:40:13] [config] transformer-dim-ffn: 2048
[2024-04-18 08:40:13] [config] transformer-dropout: 0.1
[2024-04-18 08:40:13] [config] transformer-dropout-attention: 0
[2024-04-18 08:40:13] [config] transformer-dropout-ffn: 0
[2024-04-18 08:40:13] [config] transformer-ffn-activation: swish
[2024-04-18 08:40:13] [config] transformer-ffn-depth: 2
[2024-04-18 08:40:13] [config] transformer-guided-alignment-layer: last
[2024-04-18 08:40:13] [config] transformer-heads: 8
[2024-04-18 08:40:13] [config] transformer-no-projection: false
[2024-04-18 08:40:13] [config] transformer-pool: false
[2024-04-18 08:40:13] [config] transformer-postprocess: dan
[2024-04-18 08:40:13] [config] transformer-postprocess-emb: d
[2024-04-18 08:40:13] [config] transformer-postprocess-top: ""
[2024-04-18 08:40:13] [config] transformer-preprocess: ""
[2024-04-18 08:40:13] [config] transformer-rnn-projection: false
[2024-04-18 08:40:13] [config] transformer-tied-layers:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] transformer-train-position-embeddings: false
[2024-04-18 08:40:13] [config] tsv: false
[2024-04-18 08:40:13] [config] tsv-fields: 0
[2024-04-18 08:40:13] [config] type: transformer
[2024-04-18 08:40:13] [config] ulr: false
[2024-04-18 08:40:13] [config] ulr-dim-emb: 0
[2024-04-18 08:40:13] [config] ulr-dropout: 0
[2024-04-18 08:40:13] [config] ulr-keys-vectors: ""
[2024-04-18 08:40:13] [config] ulr-query-vectors: ""
[2024-04-18 08:40:13] [config] ulr-softmax-temperature: 1
[2024-04-18 08:40:13] [config] ulr-trainable-transformation: false
[2024-04-18 08:40:13] [config] unlikelihood-loss: false
[2024-04-18 08:40:13] [config] valid-freq: 10
[2024-04-18 08:40:13] [config] valid-log: /data/training/valid.log
[2024-04-18 08:40:13] [config] valid-max-length: 1000
[2024-04-18 08:40:13] [config] valid-metrics:
[2024-04-18 08:40:13] [config]   - cross-entropy
[2024-04-18 08:40:13] [config]   - perplexity
[2024-04-18 08:40:13] [config]   - bleu
[2024-04-18 08:40:13] [config]   - translation
[2024-04-18 08:40:13] [config] valid-mini-batch: 64
[2024-04-18 08:40:13] [config] valid-reset-all: false
[2024-04-18 08:40:13] [config] valid-reset-stalled: false
[2024-04-18 08:40:13] [config] valid-script-args:
[2024-04-18 08:40:13] [config]   []
[2024-04-18 08:40:13] [config] valid-script-path: /data/training/validate.sh
[2024-04-18 08:40:13] [config] valid-sets:
[2024-04-18 08:40:13] [config]   - /data/training/data/dev.tok.tc.bpe.en
[2024-04-18 08:40:13] [config]   - /data/training/data/dev.tok.tc.factorized.bpe.de
[2024-04-18 08:40:13] [config] valid-translation-output: ""
[2024-04-18 08:40:13] [config] vocabs:
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.clean.bpe.en.yml
[2024-04-18 08:40:13] [config]   - /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv
[2024-04-18 08:40:13] [config] word-penalty: 0
[2024-04-18 08:40:13] [config] word-scores: false
[2024-04-18 08:40:13] [config] workspace: 6000
[2024-04-18 08:40:13] [config] Model is being created with Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800
[2024-04-18 08:40:13] Using synchronous SGD
[2024-04-18 08:40:13] [comm] Compiled without MPI support. Running as a single process on 25b1c50316d0
[2024-04-18 08:40:13] Synced seed 1111
[2024-04-18 08:40:13] [data] Loading vocabulary from JSON/Yaml file /data/training/data/train.tok.tc.clean.bpe.en.yml
[2024-04-18 08:40:13] [data] Setting vocabulary size for input 0 to 484
[2024-04-18 08:40:13] [vocab] Loading vocab spec file /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv
[2024-04-18 08:40:13] [vocab] Factor group '(lemma)' has 493 members
[2024-04-18 08:40:13] [vocab] Factor group '|C' has 4 members
[2024-04-18 08:40:13] [vocab] Factored-embedding map read with total/unique of 984/497 factors from 493 example words (in space of 2,470)
[2024-04-18 08:40:13] [vocab] Expanding all valid vocab entries out of 2,470...
[2024-04-18 08:40:13] [vocab] Completed, total 1966 valid combinations
[2024-04-18 08:40:13] [data] Setting vocabulary size for input 1 to 1,966
[2024-04-18 08:40:13] [data] Using word alignments from file data/train.tok.tc.clean.bpe.en.en-de.align
[2024-04-18 08:40:13] [batching] Collecting statistics for batch fitting with step size 10
[2024-04-18 08:40:13] [memory] Extending reserved space to 6016 MB (device gpu0)
[2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu1)
[2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu2)
[2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu3)
[2024-04-18 08:40:14] [comm] Using NCCL 2.8.3 for GPU communication
[2024-04-18 08:40:14] [comm] Using global sharding
[2024-04-18 08:40:14] [comm] NCCLCommunicators constructed successfully
[2024-04-18 08:40:14] [training] Using 4 GPUs
[2024-04-18 08:40:14] [vocab] Reusing existing vocabulary object in memory (vocab size 1966)
[2024-04-18 08:40:14] [embedding] Factored embeddings enabled
[2024-04-18 08:40:14] [embedding] Factored outputs enabled
[2024-04-18 08:40:14] [logits] Applying loss function for 2 factor(s)
[2024-04-18 08:40:14] [memory] Reserving 158 MB, device gpu0
[2024-04-18 08:40:14] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream)
[2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698

[CALL STACK]
[0x564280173ac4]                                                       + 0xa54ac4
[0x56428016d4a8]                                                       + 0xa4e4a8
[0x56427fbedf07]                                                       + 0x4cef07
[0x56427fca3a96]                                                       + 0x584a96
[0x56427fb6302b]                                                       + 0x44402b
[0x56427fe6c21c]                                                       + 0x74d21c
[0x56427fe534c8]                                                       + 0x7344c8
[0x56427f99261a]                                                       + 0x27361a
[0x56427f8b778b]                                                       + 0x19878b
[0x7f13eb991d90]                                                       + 0x29d90
[0x7f13eb991e40]    __libc_start_main                                  + 0x80
[0x56427f8b0b55]                                                       + 0x191b55

./train.sh: line 29:    33 Aborted                 (core dumped) /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000

marian version (in the docker environment)

root@f52169769fca:/marian# marian --version
v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800

nvidia-smi output

host system 1

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |

host system 2

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |

failing marian 1.12 cuda 12.3 docker container on host 1

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.3     |

working marian 1.11 cuda 10.2 docker container on host 1

| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |

failing marian 1.12 cuda 12.3 docker container on host 2

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |

working marian 1.11 cuda 10.2 docker container on host 2

| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |

I notice the CUDA versions that nvidia-smi outputs seem to be whatever is higher, host system or docker CUDA, but all containers have been build to run the packed cuda.

marian-nmt / marian-dev