marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.21k stars 227 forks source link

Error: Cublas Error: 13 #383

Open jie-tu914 opened 2 years ago

jie-tu914 commented 2 years ago

Bug description

Please add a clear and concise description of the bug, including observed and if possible expected behavior. When i tried to use Marian-dev to train a RNNmodel,there seems to be a bug that is Error: Cublas Error: 13 - /home/moses/moses/tj/marian/marian/src/tensors/gpu/prod.cpp:118: cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_32F, lda, B, CUDA_R_32F, ldb, beta, C, CUDA_R_32F, ldc, CUDA_R_32F, algorithm),i didn't have any changes in code.

the version of my marian is v1.11.0,cuda is cuda 10.1

The log error shows:

[2022-02-27 14:16:46] [marian] Marian v1.11.0 f00d0621 2022-02-08 08:39:24 -0800 [2022-02-27 14:16:46] [marian] Running on moses-Precision-Tower-7910 as process 7754 with command line: [2022-02-27 14:16:46] [marian] /home/moses/moses/tj/marian/marian/build/marian --sync-sgd --model model/model.npz -T . --devices 0 --train-sets data/train.bpe.de data/train.bpe.en --vocabs data/train.bpe.de.json data/train.bpe.en.json --mini-batch-fit -w 3000 --dim-vocabs 50000 50000 --layer-normalization --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 --learn-rate 0.0001 --after-epochs 0 --early-stopping 5 --valid-freq 20000 --save-freq 20000 --disp-freq 2000 --valid-mini-batch 8 --valid-sets data/dev.bpe.de data/dev.bpe.en --valid-metrics cross-entropy perplexity translation --valid-translation-output model/dev.out --valid-script-path ./score-dev.sh --seed 1111 --exponential-smoothing --normalize=1 --beam-size=12 --quiet-translation --log model/train.log --valid-log model/valid.log [2022-02-27 14:16:46] [config] after: 0e [2022-02-27 14:16:46] [config] after-batches: 0 [2022-02-27 14:16:46] [config] after-epochs: 0 [2022-02-27 14:16:46] [config] all-caps-every: 0 [2022-02-27 14:16:46] [config] allow-unk: false [2022-02-27 14:16:46] [config] authors: false [2022-02-27 14:16:46] [config] beam-size: 12 [2022-02-27 14:16:46] [config] bert-class-symbol: "[CLS]" [2022-02-27 14:16:46] [config] bert-mask-symbol: "[MASK]" [2022-02-27 14:16:46] [config] bert-masking-fraction: 0.15 [2022-02-27 14:16:46] [config] bert-sep-symbol: "[SEP]" [2022-02-27 14:16:46] [config] bert-train-type-embeddings: true [2022-02-27 14:16:46] [config] bert-type-vocab-size: 2 [2022-02-27 14:16:46] [config] build-info: "" [2022-02-27 14:16:46] [config] check-gradient-nan: false [2022-02-27 14:16:46] [config] check-nan: false [2022-02-27 14:16:46] [config] cite: false [2022-02-27 14:16:46] [config] clip-norm: 1 [2022-02-27 14:16:46] [config] cost-scaling: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] cost-type: ce-sum [2022-02-27 14:16:46] [config] cpu-threads: 0 [2022-02-27 14:16:46] [config] data-threads: 8 [2022-02-27 14:16:46] [config] data-weighting: "" [2022-02-27 14:16:46] [config] data-weighting-type: sentence [2022-02-27 14:16:46] [config] dec-cell: gru [2022-02-27 14:16:46] [config] dec-cell-base-depth: 2 [2022-02-27 14:16:46] [config] dec-cell-high-depth: 1 [2022-02-27 14:16:46] [config] dec-depth: 1 [2022-02-27 14:16:46] [config] devices: [2022-02-27 14:16:46] [config] - 0 [2022-02-27 14:16:46] [config] dim-emb: 512 [2022-02-27 14:16:46] [config] dim-rnn: 1024 [2022-02-27 14:16:46] [config] dim-vocabs: [2022-02-27 14:16:46] [config] - 50000 [2022-02-27 14:16:46] [config] - 50000 [2022-02-27 14:16:46] [config] disp-first: 0 [2022-02-27 14:16:46] [config] disp-freq: 2000 [2022-02-27 14:16:46] [config] disp-label-counts: true [2022-02-27 14:16:46] [config] dropout-rnn: 0.2 [2022-02-27 14:16:46] [config] dropout-src: 0.1 [2022-02-27 14:16:46] [config] dropout-trg: 0.1 [2022-02-27 14:16:46] [config] dump-config: "" [2022-02-27 14:16:46] [config] dynamic-gradient-scaling: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] early-stopping: 5 [2022-02-27 14:16:46] [config] early-stopping-on: first [2022-02-27 14:16:46] [config] embedding-fix-src: false [2022-02-27 14:16:46] [config] embedding-fix-trg: false [2022-02-27 14:16:46] [config] embedding-normalization: false [2022-02-27 14:16:46] [config] embedding-vectors: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] enc-cell: gru [2022-02-27 14:16:46] [config] enc-cell-depth: 1 [2022-02-27 14:16:46] [config] enc-depth: 1 [2022-02-27 14:16:46] [config] enc-type: bidirectional [2022-02-27 14:16:46] [config] english-title-case-every: 0 [2022-02-27 14:16:46] [config] exponential-smoothing: 0.0001 [2022-02-27 14:16:46] [config] factor-weight: 1 [2022-02-27 14:16:46] [config] factors-combine: sum [2022-02-27 14:16:46] [config] factors-dim-emb: 0 [2022-02-27 14:16:46] [config] gradient-checkpointing: false [2022-02-27 14:16:46] [config] gradient-norm-average-window: 100 [2022-02-27 14:16:46] [config] guided-alignment: none [2022-02-27 14:16:46] [config] guided-alignment-cost: mse [2022-02-27 14:16:46] [config] guided-alignment-weight: 0.1 [2022-02-27 14:16:46] [config] ignore-model-config: false [2022-02-27 14:16:46] [config] input-types: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] interpolate-env-vars: false [2022-02-27 14:16:46] [config] keep-best: false [2022-02-27 14:16:46] [config] label-smoothing: 0 [2022-02-27 14:16:46] [config] layer-normalization: true [2022-02-27 14:16:46] [config] learn-rate: 0.0001 [2022-02-27 14:16:46] [config] lemma-dependency: "" [2022-02-27 14:16:46] [config] lemma-dim-emb: 0 [2022-02-27 14:16:46] [config] log: model/train.log [2022-02-27 14:16:46] [config] log-level: info [2022-02-27 14:16:46] [config] log-time-zone: "" [2022-02-27 14:16:46] [config] logical-epoch: [2022-02-27 14:16:46] [config] - 1e [2022-02-27 14:16:46] [config] - 0 [2022-02-27 14:16:46] [config] lr-decay: 0 [2022-02-27 14:16:46] [config] lr-decay-freq: 50000 [2022-02-27 14:16:46] [config] lr-decay-inv-sqrt: [2022-02-27 14:16:46] [config] - 0 [2022-02-27 14:16:46] [config] lr-decay-repeat-warmup: false [2022-02-27 14:16:46] [config] lr-decay-reset-optimizer: false [2022-02-27 14:16:46] [config] lr-decay-start: [2022-02-27 14:16:46] [config] - 10 [2022-02-27 14:16:46] [config] - 1 [2022-02-27 14:16:46] [config] lr-decay-strategy: epoch+stalled [2022-02-27 14:16:46] [config] lr-report: false [2022-02-27 14:16:46] [config] lr-warmup: 0 [2022-02-27 14:16:46] [config] lr-warmup-at-reload: false [2022-02-27 14:16:46] [config] lr-warmup-cycle: false [2022-02-27 14:16:46] [config] lr-warmup-start-rate: 0 [2022-02-27 14:16:46] [config] max-length: 50 [2022-02-27 14:16:46] [config] max-length-crop: false [2022-02-27 14:16:46] [config] max-length-factor: 3 [2022-02-27 14:16:46] [config] maxi-batch: 100 [2022-02-27 14:16:46] [config] maxi-batch-sort: trg [2022-02-27 14:16:46] [config] mini-batch: 64 [2022-02-27 14:16:46] [config] mini-batch-fit: true [2022-02-27 14:16:46] [config] mini-batch-fit-step: 10 [2022-02-27 14:16:46] [config] mini-batch-round-up: true [2022-02-27 14:16:46] [config] mini-batch-track-lr: false [2022-02-27 14:16:46] [config] mini-batch-warmup: 0 [2022-02-27 14:16:46] [config] mini-batch-words: 0 [2022-02-27 14:16:46] [config] mini-batch-words-ref: 0 [2022-02-27 14:16:46] [config] model: model/model.npz [2022-02-27 14:16:46] [config] multi-loss-type: sum [2022-02-27 14:16:46] [config] n-best: false [2022-02-27 14:16:46] [config] no-nccl: false [2022-02-27 14:16:46] [config] no-reload: false [2022-02-27 14:16:46] [config] no-restore-corpus: false [2022-02-27 14:16:46] [config] normalize: 1 [2022-02-27 14:16:46] [config] normalize-gradient: false [2022-02-27 14:16:46] [config] num-devices: 0 [2022-02-27 14:16:46] [config] optimizer: adam [2022-02-27 14:16:46] [config] optimizer-delay: 1 [2022-02-27 14:16:46] [config] optimizer-params: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] output-omit-bias: false [2022-02-27 14:16:46] [config] overwrite: false [2022-02-27 14:16:46] [config] precision: [2022-02-27 14:16:46] [config] - float32 [2022-02-27 14:16:46] [config] - float32 [2022-02-27 14:16:46] [config] pretrained-model: "" [2022-02-27 14:16:46] [config] quantize-biases: false [2022-02-27 14:16:46] [config] quantize-bits: 0 [2022-02-27 14:16:46] [config] quantize-log-based: false [2022-02-27 14:16:46] [config] quantize-optimization-steps: 0 [2022-02-27 14:16:46] [config] quiet: false [2022-02-27 14:16:46] [config] quiet-translation: true [2022-02-27 14:16:46] [config] relative-paths: false [2022-02-27 14:16:46] [config] right-left: false [2022-02-27 14:16:46] [config] save-freq: 20000 [2022-02-27 14:16:46] [config] seed: 1111 [2022-02-27 14:16:46] [config] sentencepiece-alphas: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] sentencepiece-max-lines: 2000000 [2022-02-27 14:16:46] [config] sentencepiece-options: "" [2022-02-27 14:16:46] [config] sharding: global [2022-02-27 14:16:46] [config] shuffle: data [2022-02-27 14:16:46] [config] shuffle-in-ram: false [2022-02-27 14:16:46] [config] sigterm: save-and-exit [2022-02-27 14:16:46] [config] skip: false [2022-02-27 14:16:46] [config] sqlite: "" [2022-02-27 14:16:46] [config] sqlite-drop: false [2022-02-27 14:16:46] [config] sync-freq: 200u [2022-02-27 14:16:46] [config] sync-sgd: true [2022-02-27 14:16:46] [config] tempdir: . [2022-02-27 14:16:46] [config] tied-embeddings: false [2022-02-27 14:16:46] [config] tied-embeddings-all: false [2022-02-27 14:16:46] [config] tied-embeddings-src: false [2022-02-27 14:16:46] [config] train-embedder-rank: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] train-sets: [2022-02-27 14:16:46] [config] - data/train.bpe.de [2022-02-27 14:16:46] [config] - data/train.bpe.en [2022-02-27 14:16:46] [config] transformer-aan-activation: swish [2022-02-27 14:16:46] [config] transformer-aan-depth: 2 [2022-02-27 14:16:46] [config] transformer-aan-nogate: false [2022-02-27 14:16:46] [config] transformer-decoder-autoreg: self-attention [2022-02-27 14:16:46] [config] transformer-decoder-dim-ffn: 0 [2022-02-27 14:16:46] [config] transformer-decoder-ffn-depth: 0 [2022-02-27 14:16:46] [config] transformer-depth-scaling: false [2022-02-27 14:16:46] [config] transformer-dim-aan: 2048 [2022-02-27 14:16:46] [config] transformer-dim-ffn: 2048 [2022-02-27 14:16:46] [config] transformer-dropout: 0 [2022-02-27 14:16:46] [config] transformer-dropout-attention: 0 [2022-02-27 14:16:46] [config] transformer-dropout-ffn: 0 [2022-02-27 14:16:46] [config] transformer-ffn-activation: swish [2022-02-27 14:16:46] [config] transformer-ffn-depth: 2 [2022-02-27 14:16:46] [config] transformer-guided-alignment-layer: last [2022-02-27 14:16:46] [config] transformer-heads: 8 [2022-02-27 14:16:46] [config] transformer-no-projection: false [2022-02-27 14:16:46] [config] transformer-pool: false [2022-02-27 14:16:46] [config] transformer-postprocess: dan [2022-02-27 14:16:46] [config] transformer-postprocess-emb: d [2022-02-27 14:16:46] [config] transformer-postprocess-top: "" [2022-02-27 14:16:46] [config] transformer-preprocess: "" [2022-02-27 14:16:46] [config] transformer-tied-layers: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] transformer-train-position-embeddings: false [2022-02-27 14:16:46] [config] tsv: false [2022-02-27 14:16:46] [config] tsv-fields: 0 [2022-02-27 14:16:46] [config] type: amun [2022-02-27 14:16:46] [config] ulr: false [2022-02-27 14:16:46] [config] ulr-dim-emb: 0 [2022-02-27 14:16:46] [config] ulr-dropout: 0 [2022-02-27 14:16:46] [config] ulr-keys-vectors: "" [2022-02-27 14:16:46] [config] ulr-query-vectors: "" [2022-02-27 14:16:46] [config] ulr-softmax-temperature: 1 [2022-02-27 14:16:46] [config] ulr-trainable-transformation: false [2022-02-27 14:16:46] [config] unlikelihood-loss: false [2022-02-27 14:16:46] [config] valid-freq: 20000 [2022-02-27 14:16:46] [config] valid-log: model/valid.log [2022-02-27 14:16:46] [config] valid-max-length: 1000 [2022-02-27 14:16:46] [config] valid-metrics: [2022-02-27 14:16:46] [config] - cross-entropy [2022-02-27 14:16:46] [config] - perplexity [2022-02-27 14:16:46] [config] - translation [2022-02-27 14:16:46] [config] valid-mini-batch: 8 [2022-02-27 14:16:46] [config] valid-reset-stalled: false [2022-02-27 14:16:46] [config] valid-script-args: [2022-02-27 14:16:46] [config] [] [2022-02-27 14:16:46] [config] valid-script-path: ./score-dev.sh [2022-02-27 14:16:46] [config] valid-sets: [2022-02-27 14:16:46] [config] - data/dev.bpe.de [2022-02-27 14:16:46] [config] - data/dev.bpe.en [2022-02-27 14:16:46] [config] valid-translation-output: model/dev.out [2022-02-27 14:16:46] [config] vocabs: [2022-02-27 14:16:46] [config] - data/train.bpe.de.json [2022-02-27 14:16:46] [config] - data/train.bpe.en.json [2022-02-27 14:16:46] [config] word-penalty: 0 [2022-02-27 14:16:46] [config] word-scores: false [2022-02-27 14:16:46] [config] workspace: 3000 [2022-02-27 14:16:46] [config] Model is being created with Marian v1.11.0 f00d0621 2022-02-08 08:39:24 -0800 [2022-02-27 14:16:46] Using synchronous SGD [2022-02-27 14:16:46] [comm] Compiled without MPI support. Running as a single process on moses-Precision-Tower-7910 [2022-02-27 14:16:46] Synced seed 1111 [2022-02-27 14:16:46] [data] Loading vocabulary from JSON/Yaml file data/train.bpe.de.json [2022-02-27 14:16:46] [data] Using unused word id eos for 0 [2022-02-27 14:16:46] [data] Using unused word id UNK for 1 [2022-02-27 14:16:46] [data] Setting vocabulary size for input 0 to 50,000 [2022-02-27 14:16:46] [data] Loading vocabulary from JSON/Yaml file data/train.bpe.en.json [2022-02-27 14:16:47] [data] Using unused word id eos for 0 [2022-02-27 14:16:47] [data] Using unused word id UNK for 1 [2022-02-27 14:16:47] [data] Setting vocabulary size for input 1 to 50,000 [2022-02-27 14:16:47] [batching] Collecting statistics for batch fitting with step size 10 [2022-02-27 14:16:47] [memory] Extending reserved space to 3072 MB (device gpu0) [2022-02-27 14:16:47] [comm] Using NCCL 2.8.3 for GPU communication [2022-02-27 14:16:47] [comm] Using global sharding [2022-02-27 14:16:47] [comm] NCCLCommunicators constructed successfully [2022-02-27 14:16:47] [training] Using 1 GPUs [2022-02-27 14:16:47] [logits] Applying loss function for 1 factor(s) [2022-02-27 14:16:47] [memory] Reserving 422 MB, device gpu0 [2022-02-27 14:16:47] [gpu] 16-bit TensorCores enabled for float32 matrix operations [2022-02-27 14:16:47] Error: Cublas Error: 13 - /home/moses/moses/tj/marian/marian/src/tensors/gpu/prod.cpp:118: cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_32F, lda, B, CUDA_R_32F, ldb, beta, C, CUDA_R_32F, ldc, CUDA_R_32F, algorithm) [2022-02-27 14:16:47] Error: Aborted from static void marian::gpu::TypedGemm<float, float>::gemm(cublasHandle_t, marian::gpu::CudaCompute, cublasOperation_t, cublasOperation_t, int, int, int, const float, const float, int, const float, int, const float, float*, int) in /home/moses/moses/tj/marian/marian/src/tensors/gpu/prod.cpp:118

[CALL STACK] [0x56408b3f5fd6] marian::gpu::TypedGemm<float,float>:: gemm (cublasContext, marian::gpu::CudaCompute, cublasOperation_t, cublasOperation_t, int, int, int, float const, float const, int, float const, int, float const, float, int) + 0x5a6 [0x56408b3f6e75] void marian::gpu:: ProdTyped <float,float>(IntrusivePtr, IntrusivePtr const&, IntrusivePtr const&, bool, bool, float, float) + 0x8a5 [0x56408b3f1283] marian::gpu:: Prod (IntrusivePtr, IntrusivePtr const&, IntrusivePtr const&, bool, bool, float, float, marian::Type) + 0x493 [0x56408b3f16de] marian::gpu:: Prod (IntrusivePtr, IntrusivePtr const&, IntrusivePtr const&, bool, bool, float, float) + 0x4e [0x56408aeeede2] std::_Function_handler<void (),marian::DotNodeOp::forwardOps()::{lambda()#1}>:: _M_invoke (std::_Any_data const&) + 0x1f2 [0x56408af99e61] marian::Node:: forward () + 0x211 [0x56408ae8ebcb] marian::ExpressionGraph:: forward (std::cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr>>>>&, bool) + 0x22b [0x56408ae9074c] marian::ExpressionGraph:: forwardNext () + 0x23c [0x56408b133388] marian::GraphGroup:: collectStats (std::shared_ptr, std::shared_ptr, std::vector<std::shared_ptr,std::allocator<std::shared_ptr>> const&, double) + 0xcc8 [0x56408b119f38] marian::SyncGraphGroup:: collectStats (std::vector<std::shared_ptr,std::allocator<std::shared_ptr>> const&) + 0x138 [0x56408acbf0a7] marian::Train:: run () + 0x5c7 [0x56408ac08146] mainTrainer (int, char**) + 0x136 [0x56408abbd8c5] main + 0x35 [0x7f6d39389bf7] libc_start_main + 0xe7 [0x56408ac066fa] _start + 0x2a

Aborted (core dumped)

Context

train.log