Cublas Error: 13 - Githubissues

xiaohw-97 commented 4 years ago

Hi, there seems to be a bug when I tried to train the transformer model. It also happened when I tried to use Marian-dev.

The log error shows:

[2020-05-03 12:35:12] [config] after-batches: 0 [2020-05-03 12:35:12] [config] after-epochs: 2 [2020-05-03 12:35:12] [config] all-caps-every: 0 [2020-05-03 12:35:12] [config] allow-unk: false [2020-05-03 12:35:12] [config] authors: false [2020-05-03 12:35:12] [config] beam-size: 12 [2020-05-03 12:35:12] [config] bert-class-symbol: "[CLS]" [2020-05-03 12:35:12] [config] bert-mask-symbol: "[MASK]" [2020-05-03 12:35:12] [config] bert-masking-fraction: 0.15 [2020-05-03 12:35:12] [config] bert-sep-symbol: "[SEP]" [2020-05-03 12:35:12] [config] bert-train-type-embeddings: true [2020-05-03 12:35:12] [config] bert-type-vocab-size: 2 [2020-05-03 12:35:12] [config] build-info: "" [2020-05-03 12:35:12] [config] cite: false [2020-05-03 12:35:12] [config] clip-gemm: 0 [2020-05-03 12:35:12] [config] clip-norm: 5 [2020-05-03 12:35:12] [config] cost-scaling: [2020-05-03 12:35:12] [config] [] [2020-05-03 12:35:12] [config] cost-type: perplexity [2020-05-03 12:35:12] [config] cpu-threads: 0 [2020-05-03 12:35:12] [config] data-weighting: "" [2020-05-03 12:35:12] [config] data-weighting-type: sentence [2020-05-03 12:35:12] [config] dec-cell: gru [2020-05-03 12:35:12] [config] dec-cell-base-depth: 2 [2020-05-03 12:35:12] [config] dec-cell-high-depth: 1 [2020-05-03 12:35:12] [config] dec-depth: 6 [2020-05-03 12:35:12] [config] devices: [2020-05-03 12:35:12] [config] - 0 [2020-05-03 12:35:12] [config] dim-emb: 512 [2020-05-03 12:35:12] [config] dim-rnn: 1024 [2020-05-03 12:35:12] [config] dim-vocabs: [2020-05-03 12:35:12] [config] - 0 [2020-05-03 12:35:12] [config] - 0 [2020-05-03 12:35:12] [config] disp-first: 0 [2020-05-03 12:35:12] [config] disp-freq: 1000 [2020-05-03 12:35:12] [config] disp-label-counts: false [2020-05-03 12:35:12] [config] dropout-rnn: 0 [2020-05-03 12:35:12] [config] dropout-src: 0 [2020-05-03 12:35:12] [config] dropout-trg: 0 [2020-05-03 12:35:12] [config] dump-config: "" [2020-05-03 12:35:12] [config] early-stopping: 5 [2020-05-03 12:35:12] [config] embedding-fix-src: false [2020-05-03 12:35:12] [config] embedding-fix-trg: false [2020-05-03 12:35:12] [config] embedding-normalization: false [2020-05-03 12:35:12] [config] embedding-vectors: [2020-05-03 12:35:12] [config] [] [2020-05-03 12:35:12] [config] enc-cell: gru [2020-05-03 12:35:12] [config] enc-cell-depth: 1 [2020-05-03 12:35:12] [config] enc-depth: 6 [2020-05-03 12:35:12] [config] enc-type: bidirectional [2020-05-03 12:35:12] [config] english-title-case-every: 0 [2020-05-03 12:35:12] [config] exponential-smoothing: 0.0001 [2020-05-03 12:35:12] [config] factor-weight: 1 [2020-05-03 12:35:12] [config] grad-dropping-momentum: 0 [2020-05-03 12:35:12] [config] grad-dropping-rate: 0 [2020-05-03 12:35:12] [config] grad-dropping-warmup: 100 [2020-05-03 12:35:12] [config] gradient-checkpointing: false [2020-05-03 12:35:12] [config] guided-alignment: none [2020-05-03 12:35:12] [config] guided-alignment-cost: mse [2020-05-03 12:35:12] [config] guided-alignment-weight: 0.1 [2020-05-03 12:35:12] [config] ignore-model-config: false [2020-05-03 12:35:12] [config] input-types: [2020-05-03 12:35:12] [config] [] [2020-05-03 12:35:12] [config] interpolate-env-vars: false [2020-05-03 12:35:12] [config] keep-best: true [2020-05-03 12:35:12] [config] label-smoothing: 0.1 [2020-05-03 12:35:12] [config] layer-normalization: false [2020-05-03 12:35:12] [config] learn-rate: 0.0003 [2020-05-03 12:35:12] [config] lemma-dim-emb: 0 [2020-05-03 12:35:12] [config] log: models/lm.1/train.log [2020-05-03 12:35:12] [config] log-level: info [2020-05-03 12:35:12] [config] log-time-zone: "" [2020-05-03 12:35:12] [config] lr-decay: 0 [2020-05-03 12:35:12] [config] lr-decay-freq: 50000 [2020-05-03 12:35:12] [config] lr-decay-inv-sqrt: [2020-05-03 12:35:12] [config] - 16000 [2020-05-03 12:35:12] [config] lr-decay-repeat-warmup: false [2020-05-03 12:35:12] [config] lr-decay-reset-optimizer: false [2020-05-03 12:35:12] [config] lr-decay-start: [2020-05-03 12:35:12] [config] - 10 [2020-05-03 12:35:12] [config] - 1 [2020-05-03 12:35:12] [config] lr-decay-strategy: epoch+stalled [2020-05-03 12:35:12] [config] lr-report: true [2020-05-03 12:35:12] [config] lr-warmup: 16000 [2020-05-03 12:35:12] [config] lr-warmup-at-reload: false [2020-05-03 12:35:12] [config] lr-warmup-cycle: false [2020-05-03 12:35:12] [config] lr-warmup-start-rate: 0 [2020-05-03 12:35:12] [config] max-length: 120 [2020-05-03 12:35:12] [config] max-length-crop: true [2020-05-03 12:35:12] [config] max-length-factor: 3 [2020-05-03 12:35:12] [config] maxi-batch: 1000 [2020-05-03 12:35:12] [config] maxi-batch-sort: trg [2020-05-03 12:35:12] [config] mini-batch: 1000 [2020-05-03 12:35:12] [config] mini-batch-fit: true [2020-05-03 12:35:12] [config] mini-batch-fit-step: 10 [2020-05-03 12:35:12] [config] mini-batch-track-lr: false [2020-05-03 12:35:12] [config] mini-batch-warmup: 0 [2020-05-03 12:35:12] [config] mini-batch-words: 0 [2020-05-03 12:35:12] [config] mini-batch-words-ref: 0 [2020-05-03 12:35:12] [config] model: models/lm.1/model.npz [2020-05-03 12:35:12] [config] multi-loss-type: sum [2020-05-03 12:35:12] [config] multi-node: false [2020-05-03 12:35:12] [config] multi-node-overlap: true [2020-05-03 12:35:12] [config] n-best: false [2020-05-03 12:35:12] [config] no-nccl: false [2020-05-03 12:35:12] [config] no-reload: false [2020-05-03 12:35:12] [config] no-restore-corpus: false [2020-05-03 12:35:12] [config] normalize: 0 [2020-05-03 12:35:12] [config] normalize-gradient: false [2020-05-03 12:35:12] [config] num-devices: 0 [2020-05-03 12:35:12] [config] optimizer: adam [2020-05-03 12:35:12] [config] optimizer-delay: 2 [2020-05-03 12:35:12] [config] optimizer-params: [2020-05-03 12:35:12] [config] - 0.9 [2020-05-03 12:35:12] [config] - 0.98 [2020-05-03 12:35:12] [config] - 1e-09 [2020-05-03 12:35:12] [config] overwrite: true [2020-05-03 12:35:12] [config] precision: [2020-05-03 12:35:12] [config] - float32 [2020-05-03 12:35:12] [config] - float32 [2020-05-03 12:35:12] [config] - float32 [2020-05-03 12:35:12] [config] pretrained-model: "" [2020-05-03 12:35:12] [config] quiet: false [2020-05-03 12:35:12] [config] quiet-translation: false [2020-05-03 12:35:12] [config] relative-paths: false [2020-05-03 12:35:12] [config] right-left: false [2020-05-03 12:35:12] [config] save-freq: 10000 [2020-05-03 12:35:12] [config] seed: 0 [2020-05-03 12:35:12] [config] sentencepiece-alphas: [2020-05-03 12:35:12] [config] [] [2020-05-03 12:35:12] [config] sentencepiece-max-lines: 10000000 [2020-05-03 12:35:12] [config] sentencepiece-options: "" [2020-05-03 12:35:12] [config] shuffle: data [2020-05-03 12:35:12] [config] shuffle-in-ram: false [2020-05-03 12:35:12] [config] skip: false [2020-05-03 12:35:12] [config] sqlite: "" [2020-05-03 12:35:12] [config] sqlite-drop: false [2020-05-03 12:35:12] [config] sync-sgd: true [2020-05-03 12:35:12] [config] tempdir: /tmp [2020-05-03 12:35:12] [config] tied-embeddings: false [2020-05-03 12:35:12] [config] tied-embeddings-all: true [2020-05-03 12:35:12] [config] tied-embeddings-src: false [2020-05-03 12:35:12] [config] train-sets: [2020-05-03 12:35:12] [config] - ./data/mono.lc.bpe.gz [2020-05-03 12:35:12] [config] transformer-aan-activation: swish [2020-05-03 12:35:12] [config] transformer-aan-depth: 2 [2020-05-03 12:35:12] [config] transformer-aan-nogate: false [2020-05-03 12:35:12] [config] transformer-decoder-autoreg: self-attention [2020-05-03 12:35:12] [config] transformer-depth-scaling: false [2020-05-03 12:35:12] [config] transformer-dim-aan: 2048 [2020-05-03 12:35:12] [config] transformer-dim-ffn: 2048 [2020-05-03 12:35:12] [config] transformer-dropout: 0.1 [2020-05-03 12:35:12] [config] transformer-dropout-attention: 0 [2020-05-03 12:35:12] [config] transformer-dropout-ffn: 0 [2020-05-03 12:35:12] [config] transformer-ffn-activation: swish [2020-05-03 12:35:12] [config] transformer-ffn-depth: 2 [2020-05-03 12:35:12] [config] transformer-guided-alignment-layer: last [2020-05-03 12:35:12] [config] transformer-heads: 8 [2020-05-03 12:35:12] [config] transformer-no-projection: false [2020-05-03 12:35:12] [config] transformer-postprocess: dan [2020-05-03 12:35:12] [config] transformer-postprocess-emb: d [2020-05-03 12:35:12] [config] transformer-preprocess: "" [2020-05-03 12:35:12] [config] transformer-tied-layers: [2020-05-03 12:35:12] [config] [] [2020-05-03 12:35:12] [config] transformer-train-position-embeddings: false [2020-05-03 12:35:12] [config] type: lm-transformer [2020-05-03 12:35:12] [config] ulr: false [2020-05-03 12:35:12] [config] ulr-dim-emb: 0 [2020-05-03 12:35:12] [config] ulr-dropout: 0 [2020-05-03 12:35:12] [config] ulr-keys-vectors: "" [2020-05-03 12:35:12] [config] ulr-query-vectors: "" [2020-05-03 12:35:12] [config] ulr-softmax-temperature: 1 [2020-05-03 12:35:12] [config] ulr-trainable-transformation: false [2020-05-03 12:35:12] [config] unlikelihood-loss: false [2020-05-03 12:35:12] [config] valid-freq: 10000 [2020-05-03 12:35:12] [config] valid-log: models/lm.1/valid.log [2020-05-03 12:35:12] [config] valid-max-length: 1000 [2020-05-03 12:35:12] [config] valid-metrics: [2020-05-03 12:35:12] [config] - perplexity [2020-05-03 12:35:12] [config] - ce-mean-words [2020-05-03 12:35:12] [config] valid-mini-batch: 16 [2020-05-03 12:35:12] [config] valid-reset-stalled: false [2020-05-03 12:35:12] [config] valid-script-args: [2020-05-03 12:35:12] [config] [] [2020-05-03 12:35:12] [config] valid-script-path: "" [2020-05-03 12:35:12] [config] valid-sets: [2020-05-03 12:35:12] [config] - ./data/devset.lm.lc.bpe.cor [2020-05-03 12:35:12] [config] valid-translation-output: "" [2020-05-03 12:35:12] [config] vocabs: [2020-05-03 12:35:12] [config] - ./data/helpers/vocab.yml [2020-05-03 12:35:12] [config] word-penalty: 0 [2020-05-03 12:35:12] [config] word-scores: false [2020-05-03 12:35:12] [config] workspace: 9000 [2020-05-03 12:35:12] [config] Model is being created with Marian v1.9.0 3c7a88f4 2020-03-10 11:34:07 -0700 [2020-05-03 12:35:12] Using synchronous training [2020-05-03 12:35:12] [data] Loading vocabulary from JSON/Yaml file ./data/helpers/vocab.yml [2020-05-03 12:35:12] [data] Setting vocabulary size for input 0 to 50277 [2020-05-03 12:35:12] Compiled without MPI support. Falling back to FakeMPIWrapper [2020-05-03 12:35:12] [batching] Collecting statistics for batch fitting with step size 10 [2020-05-03 12:35:13] [memory] Extending reserved space to 9088 MB (device gpu0) [2020-05-03 12:35:13] [comm] Using NCCL 2.3.7 for GPU communication [2020-05-03 12:35:13] [comm] NCCLCommunicator constructed successfully. [2020-05-03 12:35:13] [training] Using 1 GPUs [2020-05-03 12:35:13] [logits] applyLossFunction() for 1 factors [2020-05-03 12:35:13] [memory] Reserving 170 MB, device gpu0 [2020-05-03 12:35:13] [gpu] 16-bit TensorCores enabled for float32 matrix operations [2020-05-03 12:35:13] Error: Cublas Error: 13 - /home/k19234training/tools/marian/src/tensors/gpu/prod.cpp:173: cublasGemmTyped(cublasHandle, computeCapability, opB, opA, n, m, k, &alpha, B->data(), ldb, A->data(), lda, &beta, C->data(), ldc) [2020-05-03 12:35:13] Error: Aborted from void marian::gpu::ProdTyped(marian::Tensor, const Tensor&, const Tensor&, bool, bool, T, T) [with T = float; marian::Tensor = IntrusivePtr] in /home/k19234/training/tools/marian/src/tensors/gpu/prod.cpp:173

snukky commented 4 years ago

Could you provide the cmake command you use and attach the output of --build-info all?

xiaohw-97 commented 4 years ago

cmake command is: cmake .. -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-9.0 -DUSE_SENTENCEPIECE=on.

One thing weird is that when I tried to used the marian-decoder in Marian, it worked well, however, the marian in Marian has this problem.

Both marian-decoder and marian has problem under the Marian-dev, which is the newer version I think.

build information is shown below: -- The CXX compiler identification is GNU 5.4.0 -- The C compiler identification is GNU 5.4.0 -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Project name: marian -- Project version: v1.9.0+3c7a88f4 CMake Warning at CMakeLists.txt:55 (message): CMAKE_BUILD_TYPE not set; setting to Release

-- Checking support for CPU intrinsics -- SSE2 support found -- SSE3 support found -- SSE4.1 support found -- SSE4.2 support found -- AVX support found -- AVX2 support found -- AVX512 support found -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-9.0 (found suitable version "9.0", minimum required is "9.0") -- Found CUDA libraries: /usr/local/cuda-9.0/lib64/libcurand.so /usr/local/cuda-9.0/lib64/libcusparse.so /usr/local/cuda-9.0/lib64/libcublas.so;/usr/local/cuda-9.0/lib64/libcublas_device.a -- Found Tcmalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so -- Found MKL: -Wl,--start-group;/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a;/opt/intel/mkl/lib/intel64/libmkl_sequential.a;/opt/intel/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group
-- VERSION: 0.1.6 -- Found Protobuf: /usr/lib/x86_64-linux-gnu/libprotobuf.so;-lpthread (found version "3.0.0") -- Found TCMalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so -- Found Doxygen: /home/k1923403/miniconda3/envs/2018/bin/doxygen (found version "1.8.18 (13b3d330f24ff38b30d8b58ebd2ded6ec1ab7a85*)") found components: doxygen missing components: dot -- Configuring done -- Generating done

Gldkslfmsd commented 4 years ago

Hi, I'm having the same Cublas Error: 13 error:

...
[2020-09-30 12:00:55] [config] Model is being created with Marian v1.9.0 ba94c5b9 2020-05-17 10:42:17 +0100
[2020-09-30 12:00:55] Using single-device training
[2020-09-30 12:00:55] [data] Creating vocabulary model/vocab.ro.yml from data/corpus.bpe.ro
[2020-09-30 12:01:03] [data] Loading vocabulary from JSON/Yaml file model/vocab.ro.yml
[2020-09-30 12:01:03] [data] Setting vocabulary size for input 0 to 66000
[2020-09-30 12:01:03] [data] Creating vocabulary model/vocab.en.yml from data/corpus.bpe.en
[2020-09-30 12:01:10] [data] Loading vocabulary from JSON/Yaml file model/vocab.en.yml
[2020-09-30 12:01:10] [data] Setting vocabulary size for input 1 to 50000
[2020-09-30 12:01:10] Compiled without MPI support. Falling back to FakeMPIWrapper
[2020-09-30 12:01:10] [batching] Collecting statistics for batch fitting with step size 10
[2020-09-30 12:02:48] [memory] Extending reserved space to 3072 MB (device gpu0)
[2020-09-30 12:02:49] [logits] applyLossFunction() for 1 factors
[2020-09-30 12:02:49] [memory] Reserving 453 MB, device gpu0
[2020-09-30 12:02:49] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2020-09-30 12:02:49] Error: Cublas Error: 13 - /lnet/troja/projects/elitr/gputest/marian-gputest/src/tensors/gpu/prod.cpp:173: cublasGemmTyped(cublasHandle, computeCapability, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc)
[2020-09-30 12:02:49] Error: Aborted from void marian::gpu::ProdTyped(marian::Tensor, const Tensor&, const Tensor&, bool, bool, T, T) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>] in /lnet/troja/projects/elitr/gputest/marian-gputest/src/tensors/gpu/prod.cpp:173

--build-info all

AVX2_FOUND=true
AVX512_FOUND=true
AVX_FOUND=true
BUILD_ARCH=native
CMAKE_AR=/usr/bin/ar
CMAKE_BUILD_TYPE=Release
CMAKE_COLOR_MAKEFILE=ON
CMAKE_CXX_COMPILER=/usr/bin/c++
CMAKE_CXX_COMPILER_AR=/usr/bin/gcc-ar-7
CMAKE_CXX_COMPILER_RANLIB=/usr/bin/gcc-ranlib-7
CMAKE_CXX_FLAGS=-std=c++11 -pthread -Wl,--no-as-needed -fPIC -Wno-unused-result -Wno-unknown-warning-option  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DUSE_SENTENCEPIECE -DCUDA_FOUND -DUSE_NCCL -DMKL_ILP64 -m64
CMAKE_CXX_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_CXX_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE=-Ofast -m64 -funroll-loops -ffinite-math-only -g -rdynamic
CMAKE_CXX_FLAGS_RELWITHDEBINFO=-Ofast -m64 -funroll-loops -ffinite-math-only -g -rdynamic
CMAKE_C_COMPILER=/usr/bin/cc
CMAKE_C_COMPILER_AR=/usr/bin/gcc-ar-7
CMAKE_C_COMPILER_RANLIB=/usr/bin/gcc-ranlib-7
CMAKE_C_FLAGS=-pthread -Wl,--no-as-needed -fPIC -Wno-unused-result -Wno-unknown-warning-option  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mavx512f -DMKL_ILP64 -m64
CMAKE_C_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_C_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE=-O3 -m64 -funroll-loops -ffinite-math-only -g -rdynamic
CMAKE_C_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -ffinite-math-only -g -rdynamic
CMAKE_EXPORT_COMPILE_COMMANDS=OFF
CMAKE_INSTALL_PREFIX=/usr/local
CMAKE_LINKER=/usr/bin/ld
CMAKE_MAKE_PROGRAM=/usr/bin/make
CMAKE_NM=/usr/bin/nm
CMAKE_OBJCOPY=/usr/bin/objcopy
CMAKE_OBJDUMP=/usr/bin/objdump
CMAKE_RANLIB=/usr/bin/ranlib
CMAKE_SKIP_INSTALL_RPATH=NO
CMAKE_SKIP_RPATH=NO
CMAKE_STRIP=/usr/bin/strip
CMAKE_VERBOSE_MAKEFILE=FALSE
COMPILE_CPU=ON
COMPILE_CUDA=ON
COMPILE_CUDA_SM35=ON
COMPILE_CUDA_SM50=ON
COMPILE_CUDA_SM60=ON
COMPILE_CUDA_SM70=ON
COMPILE_EXAMPLES=OFF
COMPILE_SERVER=OFF
COMPILE_TESTS=OFF
CUDA_64_BIT_DEVICE_CODE=ON
CUDA_ATTACH_VS_BUILD_RULE_TO_CUDA_FILE=ON
CUDA_BUILD_CUBIN=OFF
CUDA_BUILD_EMULATION=OFF
CUDA_CUDART_LIBRARY=/opt/cuda/9.2/lib64/libcudart.so
CUDA_CUDA_LIBRARY=/usr/lib/x86_64-linux-gnu/libcuda.so
CUDA_HOST_COMPILATION_CPP=ON
CUDA_HOST_COMPILER=/usr/bin/cc
CUDA_NVCC_EXECUTABLE=/opt/cuda/9.2/bin/nvcc
CUDA_NVCC_FLAGS=-DCUDA_FOUND;-DUSE_NCCL;--default-stream;per-thread;-O3;-g;--use_fast_math;-arch=sm_35;-gencode=arch=compute_35,code=sm_35;-gencode=arch=compute_50,code=sm_50;-gencode=arch=compute_52,code=sm_52;-gencode=arch=compute_60,code=sm_60;-gencode=arch=compute_61,code=sm_61;-gencode=arch=compute_70,code=sm_70;-gencode=arch=compute_70,code=compute_70;-ccbin;/usr/bin/cc;-std=c++11;-Xcompiler -fPIC;-Xcompiler -Wno-unused-result;-Xcompiler -Wno-deprecated;-Xcompiler -Wno-pragmas;-Xcompiler -Wno-unused-value;-Xcompiler -Werror;-Xcompiler -msse2;-Xcompiler -msse3;-Xcompiler -msse4.1;-Xcompiler -msse4.2;-Xcompiler -mavx;-Xcompiler -mavx2;-Xcompiler -mavx512f
CUDA_PROPAGATE_HOST_FLAGS=OFF
CUDA_SDK_ROOT_DIR=CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_SEPARABLE_COMPILATION=OFF
CUDA_TOOLKIT_INCLUDE=/opt/cuda/9.2/include
CUDA_TOOLKIT_ROOT_DIR=/opt/cuda/9.2
CUDA_USE_STATIC_CUDA_RUNTIME=ON
CUDA_VERBOSE_BUILD=OFF
CUDA_VERSION=9.2
CUDA_cublas_LIBRARY=/opt/cuda/9.2/lib64/libcublas.so
CUDA_cublas_device_LIBRARY=/opt/cuda/9.2/lib64/libcublas_device.a
CUDA_cudadevrt_LIBRARY=/opt/cuda/9.2/lib64/libcudadevrt.a
CUDA_cudart_static_LIBRARY=/opt/cuda/9.2/lib64/libcudart_static.a
CUDA_cufft_LIBRARY=/opt/cuda/9.2/lib64/libcufft.so
CUDA_cupti_LIBRARY=/opt/cuda/9.2/extras/CUPTI/lib64/libcupti.so
CUDA_curand_LIBRARY=/opt/cuda/9.2/lib64/libcurand.so
CUDA_cusolver_LIBRARY=/opt/cuda/9.2/lib64/libcusolver.so
CUDA_cusparse_LIBRARY=/opt/cuda/9.2/lib64/libcusparse.so
CUDA_nppc_LIBRARY=/opt/cuda/9.2/lib64/libnppc.so
CUDA_nppial_LIBRARY=/opt/cuda/9.2/lib64/libnppial.so
CUDA_nppicc_LIBRARY=/opt/cuda/9.2/lib64/libnppicc.so
CUDA_nppicom_LIBRARY=/opt/cuda/9.2/lib64/libnppicom.so
CUDA_nppidei_LIBRARY=/opt/cuda/9.2/lib64/libnppidei.so
CUDA_nppif_LIBRARY=/opt/cuda/9.2/lib64/libnppif.so
CUDA_nppig_LIBRARY=/opt/cuda/9.2/lib64/libnppig.so
CUDA_nppim_LIBRARY=/opt/cuda/9.2/lib64/libnppim.so
CUDA_nppist_LIBRARY=/opt/cuda/9.2/lib64/libnppist.so
CUDA_nppisu_LIBRARY=/opt/cuda/9.2/lib64/libnppisu.so
CUDA_nppitc_LIBRARY=/opt/cuda/9.2/lib64/libnppitc.so
CUDA_npps_LIBRARY=/opt/cuda/9.2/lib64/libnpps.so
CUDA_rt_LIBRARY=/usr/lib/x86_64-linux-gnu/librt.so
GIT_EXECUTABLE=/usr/bin/git
INTEL_ROOT=/opt/intel
MKL_CORE_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_core.a
MKL_INCLUDE_DIR=/opt/intel/mkl/include
MKL_INCLUDE_DIRS=/opt/intel/mkl/include
MKL_INTERFACE_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a
MKL_LIBRARIES=-Wl,--start-group;/opt/intel/mkl/lib/intel64/libmkl_intel_ilp64.a;/opt/intel/mkl/lib/intel64/libmkl_sequential.a;/opt/intel/mkl/lib/intel64/libmkl_core.a;-Wl,--end-group
MKL_ROOT=/opt/intel/mkl
MKL_SEQUENTIAL_LAYER_LIBRARY=/opt/intel/mkl/lib/intel64/libmkl_sequential.a
PROTOBUF_INCLUDE_DIR=/lnet/troja/projects/elitr/gputest/marian-gputest/protobuf-3.6.1/include
PROTOBUF_LIBRARY=/lnet/troja/projects/elitr/gputest/marian-gputest/protobuf-3.6.1/lib/libprotobuf.so
PROTOBUF_PROTOC_EXECUTABLE=/lnet/troja/projects/elitr/gputest/marian-gputest/protobuf-3.6.1/bin/protoc
SSE2_FOUND=true
SSE3_FOUND=true
SSE4_1_FOUND=true
SSE4_2_FOUND=true
SSSE3_FOUND=true
Tcmalloc_INCLUDE_DIR=Tcmalloc_INCLUDE_DIR-NOTFOUND
Tcmalloc_LIBRARY=Tcmalloc_LIBRARY-NOTFOUND
USE_CCACHE=OFF
USE_CUDNN=OFF
USE_DOXYGEN=ON
USE_FBGEMM=OFF
USE_MKL=ON
USE_MPI=OFF
USE_NCCL=ON
USE_SENTENCEPIECE=on
USE_STATIC_LIBS=OFF

Can anyone help me, please?

a-cavalcanti commented 3 years ago

Hello!

I'm having the same Cublas Error: 13 error. When I train using s2s type, it works. But using the transformer type doesn't work.

Does anyone know why this happens? Could anyone help me, please?

kpu commented 3 years ago

Seen on the brand-new 3090s.

[2021-04-27 18:45:38] Error: Cublas Error: 13 - /home/heafield/marian-dev/src/tensors/gpu/prod.cpp:118: cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_32F, lda, B, CUDA_R_32F, ldb, beta, C, CUDA_R_32F, ldc, CUDA_R_32F, algorithm)
[2021-04-27 18:45:38] Error: Aborted from static void marian::gpu::TypedGemm<float, float>::gemm(cublasHandle_t, marian::gpu::CudaCompute, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) in /home/heafield/marian-dev/src/tensors/gpu/prod.cpp:118

[CALL STACK]
[0x112537d]         marian::gpu::TypedGemm<float,float>::  gemm  (cublasContext*,  marian::gpu::CudaCompute,  cublasOperation_t,  cublasOperation_t,  int,  int,  int,  float const*,  float const*,  int,  float const*,  int,  float const*,  float*,  int) + 0x63d
[0x112c290]         void marian::gpu::  ProdTyped  <float,float>(IntrusivePtr<marian::TensorBase>,  IntrusivePtr<marian::TensorBase> const&,  IntrusivePtr<marian::TensorBase> const&,  bool,  bool,  float,  float) + 0x15f0
[0x112163e]         marian::gpu::  Affine  (IntrusivePtr<marian::TensorBase>,  std::shared_ptr<marian::Allocator>,  IntrusivePtr<marian::TensorBase> const&,  IntrusivePtr<marian::TensorBase> const&,  IntrusivePtr<marian::TensorBase> const&,  bool,  bool,  float,  float,  bool) + 0x3fe
[0xe87a06]                                                            
[0xed20b3]          std::_Function_handler<void (),marian::AffineNodeOp::forwardOps()::{lambda()#1}>::  _M_invoke  (std::_Any_data const&) + 0x1f3
[0xa831f7]          marian::Node::  forward  ()                        + 0x22f
[0xa798da]          marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x75a
[0xa7ae72]          marian::ExpressionGraph::  forwardNext  ()         + 0x182
[0xc0a47e]          marian::GraphGroup::  collectStats  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::models::ICriterionFunction>,  std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&,  double) + 0xeae
[0xbf733f]          marian::SyncGraphGroup::  collectStats  (std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&) + 0x13f
[0x7f4768]          marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x3e8
[0x71b448]          mainTrainer  (int,  char**)                        + 0xc8
[0x6d7209]          main                                               + 0x89
[0x7f3ab4a14840]    __libc_start_main                                  + 0xf0
[0x718ad9]          _start                                             + 0x29

Aborted

emjotde commented 3 years ago

Hi, Is everyone in this thread using GPUs with Ampere chips?

srdecny commented 3 years ago

I am also having this error with a 3060Ti with Nvidia 465 drivers (running inside a Docker container).

czech2en-mt_1  | [2021-04-28 12:14:19] [memory] Reserving 797 MB, device gpu0
czech2en-mt_1  | [2021-04-28 12:15:58] [gpu] 16-bit TensorCores enabled for float32 matrix operations
czech2en-mt_1  | [2021-04-28 12:15:58] Error: Cublas Error: 13 - /marian/src/tensors/gpu/prod.cpp:330: cublasGemmBatchedTyped(cublasHandle, compute, opB, opA, n, m, k, &alpha, mp_bptr->data<const T*>(), ldb, mp_aptr->data<const T*>(), lda, &beta, mp_cptr->data<T*>(), ldc, batchC)
czech2en-mt_1  | [2021-04-28 12:15:58] Error: Aborted from void marian::gpu::ProdBatchedTyped(marian::Tensor, marian::Ptr<marian::Allocator>, marian::Tensor, marian::Tensor, bool, bool, T, T) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:330
czech2en-mt_1  | 
czech2en-mt_1  | [CALL STACK]
czech2en-mt_1  | [0x556cd87f3a51]    void marian::gpu::  ProdBatchedTyped  <float>(IntrusivePtr<marian::TensorBase>,  std::shared_ptr<marian::Allocator>,  IntrusivePtr<marian::TensorBase>,  IntrusivePtr<marian::TensorBase>,  bool,  bool,  float,  float) + 0xfa1
czech2en-mt_1  | [0x556cd87e9b23]    marian::gpu::  ProdBatched  (IntrusivePtr<marian::TensorBase>,  std::shared_ptr<marian::Allocator>,  IntrusivePtr<marian::TensorBase>,  IntrusivePtr<marian::TensorBase>,  bool,  bool,  float,  float) + 0x403
czech2en-mt_1  | [0x556cd85b0103]                                                       + 0x578103
czech2en-mt_1  | [0x556cd85f68a4]    std::_Function_handler<void (),marian::DotBatchedNodeOp::forwardOps()::{lambda()#1}>::  _M_invoke  (std::_Any_data const&) + 0x1b4
czech2en-mt_1  | [0x556cd845d331]    marian::Node::  forward  ()                        + 0x211
czech2en-mt_1  | [0x556cd842c859]    marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x229
czech2en-mt_1  | [0x556cd842e421]    marian::ExpressionGraph::  forwardNext  ()         + 0x231
czech2en-mt_1  | [0x556cd8492504]    marian::BeamSearch::  search  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x39b4
czech2en-mt_1  | [0x556cd83129d2]    marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}::  operator()  (unsigned long) const + 0x122
czech2en-mt_1  | [0x556cd8312ebb]    marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#1}::  operator()  () const + 0x2b
czech2en-mt_1  | [0x556cd83139f0]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x30
czech2en-mt_1  | [0x556cd82afce9]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x29
czech2en-mt_1  | [0x7ff281308907]                                                       + 0xf907
czech2en-mt_1  | [0x556cd82b42da]    std::_Function_handler<void (),marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x13a
czech2en-mt_1  | [0x556cd82b2e95]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1b5
czech2en-mt_1  | [0x7ff28102d6df]                                                       + 0xbd6df
czech2en-mt_1  | [0x7ff2813006db]                                                       + 0x76db
czech2en-mt_1  | [0x7ff2806ea71f]    clone                                              + 0x3f
czech2en-mt_1  | 
czech2en-mt_1  | [2021-04-28 12:15:58] Error: Segmentation fault
czech2en-mt_1  | [2021-04-28 12:15:58] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /marian/src/common/logging.cpp:130
czech2en-mt_1  | 
czech2en-mt_1  | [CALL STACK]
czech2en-mt_1  | [0x556cd83627e9]                                                       + 0x32a7e9
czech2en-mt_1  | [0x556cd8362b89]                                                       + 0x32ab89
czech2en-mt_1  | [0x7ff28130b980]                                                       + 0x12980
czech2en-mt_1  | [0x7ff280609a10]    abort                                              + 0x230
czech2en-mt_1  | [0x556cd87f3365]    void marian::gpu::  ProdBatchedTyped  <float>(IntrusivePtr<marian::TensorBase>,  std::shared_ptr<marian::Allocator>,  IntrusivePtr<marian::TensorBase>,  IntrusivePtr<marian::TensorBase>,  bool,  bool,  float,  float) + 0x8b5
czech2en-mt_1  | [0x556cd87e9b23]    marian::gpu::  ProdBatched  (IntrusivePtr<marian::TensorBase>,  std::shared_ptr<marian::Allocator>,  IntrusivePtr<marian::TensorBase>,  IntrusivePtr<marian::TensorBase>,  bool,  bool,  float,  float) + 0x403
czech2en-mt_1  | [0x556cd85b0103]                                                       + 0x578103
czech2en-mt_1  | [0x556cd85f68a4]    std::_Function_handler<void (),marian::DotBatchedNodeOp::forwardOps()::{lambda()#1}>::  _M_invoke  (std::_Any_data const&) + 0x1b4
czech2en-mt_1  | [0x556cd845d331]    marian::Node::  forward  ()                        + 0x211
czech2en-mt_1  | [0x556cd842c859]    marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x229
czech2en-mt_1  | [0x556cd842e421]    marian::ExpressionGraph::  forwardNext  ()         + 0x231
czech2en-mt_1  | [0x556cd8492504]    marian::BeamSearch::  search  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x39b4
czech2en-mt_1  | [0x556cd83129d2]    marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}::  operator()  (unsigned long) const + 0x122
czech2en-mt_1  | [0x556cd8312ebb]    marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#1}::  operator()  () const + 0x2b
czech2en-mt_1  | [0x556cd83139f0]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x30
czech2en-mt_1  | [0x556cd82afce9]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x29
czech2en-mt_1  | [0x7ff281308907]                                                       + 0xf907
czech2en-mt_1  | [0x556cd82b42da]    std::_Function_handler<void (),marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x13a
czech2en-mt_1  | [0x556cd82b2e95]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x1b5
czech2en-mt_1  | [0x7ff28102d6df]                                                       + 0xbd6df
czech2en-mt_1  | [0x7ff2813006db]                                                       + 0x76db
czech2en-mt_1  | [0x7ff2806ea71f]    clone                                              + 0x3f
czech2en-mt_1  |

Here's the build info:

root@3685366eba61:/marian/build# ./marian-server --build-info all
AVX2_FOUND=true
AVX512_FOUND=false
AVX_FOUND=true
BLAS_Accelerate_LIBRARY=BLAS_Accelerate_LIBRARY-NOTFOUND
BLAS_acml_LIBRARY=BLAS_acml_LIBRARY-NOTFOUND
BLAS_acml_mp_LIBRARY=BLAS_acml_mp_LIBRARY-NOTFOUND
BLAS_blas_LIBRARY=BLAS_blas_LIBRARY-NOTFOUND
BLAS_complib.sgimath_LIBRARY=BLAS_complib.sgimath_LIBRARY-NOTFOUND
BLAS_cxml_LIBRARY=BLAS_cxml_LIBRARY-NOTFOUND
BLAS_dxml_LIBRARY=BLAS_dxml_LIBRARY-NOTFOUND
BLAS_essl_LIBRARY=BLAS_essl_LIBRARY-NOTFOUND
BLAS_f77blas_LIBRARY=BLAS_f77blas_LIBRARY-NOTFOUND
BLAS_goto2_LIBRARY=BLAS_goto2_LIBRARY-NOTFOUND
BLAS_mkl_LIBRARY=BLAS_mkl_LIBRARY-NOTFOUND
BLAS_mkl_em64t_LIBRARY=BLAS_mkl_em64t_LIBRARY-NOTFOUND
BLAS_mkl_ia32_LIBRARY=BLAS_mkl_ia32_LIBRARY-NOTFOUND
BLAS_mkl_intel_LIBRARY=BLAS_mkl_intel_LIBRARY-NOTFOUND
BLAS_mkl_intel_lp64_LIBRARY=BLAS_mkl_intel_lp64_LIBRARY-NOTFOUND
BLAS_openblas_LIBRARY=BLAS_openblas_LIBRARY-NOTFOUND
BLAS_scsl_LIBRARY=BLAS_scsl_LIBRARY-NOTFOUND
BLAS_sgemm_LIBRARY=BLAS_sgemm_LIBRARY-NOTFOUND
BLAS_sunperf_LIBRARY=BLAS_sunperf_LIBRARY-NOTFOUND
BLAS_vecLib_LIBRARY=BLAS_vecLib_LIBRARY-NOTFOUND
BUILD_ARCH=native
Boost_DIR=Boost_DIR-NOTFOUND
Boost_INCLUDE_DIR=/usr/include
Boost_LIBRARY_DIR_DEBUG=/usr/lib/x86_64-linux-gnu
Boost_LIBRARY_DIR_RELEASE=/usr/lib/x86_64-linux-gnu
Boost_SYSTEM_LIBRARY_DEBUG=/usr/lib/x86_64-linux-gnu/libboost_system.so
Boost_SYSTEM_LIBRARY_RELEASE=/usr/lib/x86_64-linux-gnu/libboost_system.so
CMAKE_AR=/usr/bin/ar
CMAKE_BUILD_TYPE=Release
CMAKE_COLOR_MAKEFILE=ON
CMAKE_CXX_COMPILER=/usr/bin/c++
CMAKE_CXX_COMPILER_AR=/usr/bin/gcc-ar-7
CMAKE_CXX_COMPILER_RANLIB=/usr/bin/gcc-ranlib-7
CMAKE_CXX_FLAGS=-std=c++11 -pthread -Wl,--no-as-needed -fPIC -Wno-unused-result  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -DUSE_SENTENCEPIECE -DCUDA_FOUND -DUSE_NCCL
CMAKE_CXX_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_CXX_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_CXX_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_COMPILER=/usr/bin/cc
CMAKE_C_COMPILER_AR=/usr/bin/gcc-ar-7
CMAKE_C_COMPILER_RANLIB=/usr/bin/gcc-ranlib-7
CMAKE_C_FLAGS=-pthread -Wl,--no-as-needed -fPIC -Wno-unused-result  -march=native  -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2
CMAKE_C_FLAGS_DEBUG=-O0 -g -rdynamic
CMAKE_C_FLAGS_MINSIZEREL=-Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_C_FLAGS_RELWITHDEBINFO=-O3 -m64 -funroll-loops -g -rdynamic
CMAKE_EXPORT_COMPILE_COMMANDS=OFF
CMAKE_INSTALL_PREFIX=/usr/local
CMAKE_LINKER=/usr/bin/ld
CMAKE_MAKE_PROGRAM=/usr/bin/make
CMAKE_NM=/usr/bin/nm
CMAKE_OBJCOPY=/usr/bin/objcopy
CMAKE_OBJDUMP=/usr/bin/objdump
CMAKE_RANLIB=/usr/bin/ranlib
CMAKE_SKIP_INSTALL_RPATH=NO
CMAKE_SKIP_RPATH=NO
CMAKE_STRIP=/usr/bin/strip
CMAKE_VERBOSE_MAKEFILE=FALSE
COMPILE_CPU=ON
COMPILE_CUDA=ON
COMPILE_CUDA_SM35=ON
COMPILE_CUDA_SM50=ON
COMPILE_CUDA_SM60=ON
COMPILE_CUDA_SM70=ON
COMPILE_EXAMPLES=OFF
COMPILE_SERVER=on
COMPILE_TESTS=OFF
CUDA_64_BIT_DEVICE_CODE=ON
CUDA_ATTACH_VS_BUILD_RULE_TO_CUDA_FILE=ON
CUDA_BUILD_CUBIN=OFF
CUDA_BUILD_EMULATION=OFF
CUDA_CUDART_LIBRARY=/usr/local/cuda/lib64/libcudart.so
CUDA_CUDA_LIBRARY=CUDA_CUDA_LIBRARY-NOTFOUND
CUDA_HOST_COMPILATION_CPP=ON
CUDA_HOST_COMPILER=/usr/bin/cc
CUDA_NVCC_EXECUTABLE=/usr/local/cuda/bin/nvcc
CUDA_NVCC_FLAGS=-DCUDA_FOUND-DUSE_NCCL--default-streamper-thread-O3-g--use_fast_math-arch=sm_35-gencode=arch=compute_35,code=sm_35-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_70,code=compute_70-ccbin/usr/bin/cc-std=c++11-Xcompiler -fPIC-Xcompiler -Wno-unused-result-Xcompiler -Wno-deprecated-Xcompiler -Wno-pragmas-Xcompiler -Wno-unused-value-Xcompiler -Werror-Xcompiler -msse2-Xcompiler -msse3-Xcompiler -msse4.1-Xcompiler -msse4.2-Xcompiler -mavx-Xcompiler -mavx2
CUDA_PROPAGATE_HOST_FLAGS=OFF
CUDA_SDK_ROOT_DIR=CUDA_SDK_ROOT_DIR-NOTFOUND
CUDA_SEPARABLE_COMPILATION=OFF
CUDA_TOOLKIT_INCLUDE=/usr/local/cuda/include
CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
CUDA_USE_STATIC_CUDA_RUNTIME=ON
CUDA_VERBOSE_BUILD=OFF
CUDA_VERSION=9.2
CUDA_cublas_LIBRARY=/usr/local/cuda/lib64/libcublas.so
CUDA_cublas_device_LIBRARY=/usr/local/cuda/lib64/libcublas_device.a
CUDA_cudadevrt_LIBRARY=/usr/local/cuda/lib64/libcudadevrt.a
CUDA_cudart_static_LIBRARY=/usr/local/cuda/lib64/libcudart_static.a
CUDA_cufft_LIBRARY=/usr/local/cuda/lib64/libcufft.so
CUDA_cupti_LIBRARY=/usr/local/cuda/extras/CUPTI/lib64/libcupti.so
CUDA_curand_LIBRARY=/usr/local/cuda/lib64/libcurand.so
CUDA_cusolver_LIBRARY=/usr/local/cuda/lib64/libcusolver.so
CUDA_cusparse_LIBRARY=/usr/local/cuda/lib64/libcusparse.so
CUDA_nppc_LIBRARY=/usr/local/cuda/lib64/libnppc.so
CUDA_nppial_LIBRARY=/usr/local/cuda/lib64/libnppial.so
CUDA_nppicc_LIBRARY=/usr/local/cuda/lib64/libnppicc.so
CUDA_nppicom_LIBRARY=/usr/local/cuda/lib64/libnppicom.so
CUDA_nppidei_LIBRARY=/usr/local/cuda/lib64/libnppidei.so
CUDA_nppif_LIBRARY=/usr/local/cuda/lib64/libnppif.so
CUDA_nppig_LIBRARY=/usr/local/cuda/lib64/libnppig.so
CUDA_nppim_LIBRARY=/usr/local/cuda/lib64/libnppim.so
CUDA_nppist_LIBRARY=/usr/local/cuda/lib64/libnppist.so
CUDA_nppisu_LIBRARY=/usr/local/cuda/lib64/libnppisu.so
CUDA_nppitc_LIBRARY=/usr/local/cuda/lib64/libnppitc.so
CUDA_npps_LIBRARY=/usr/local/cuda/lib64/libnpps.so
CUDA_rt_LIBRARY=/usr/lib/x86_64-linux-gnu/librt.so
GENERATE_MARIAN_INSTALL_TARGETS=OFF
GIT_EXECUTABLE=/usr/bin/git
INTEL_ROOT=/opt/intel
MKL_INCLUDE_DIR=MKL_INCLUDE_DIR-NOTFOUND
MKL_ROOT=MKL_ROOT-NOTFOUND
OPENSSL_CRYPTO_LIBRARY=/usr/lib/x86_64-linux-gnu/libcrypto.so
OPENSSL_INCLUDE_DIR=/usr/include
OPENSSL_LIBRARIES=/usr/lib/x86_64-linux-gnu/libssl.so/usr/lib/x86_64-linux-gnu/libcrypto.so
OPENSSL_SSL_LIBRARY=/usr/lib/x86_64-linux-gnu/libssl.so
PKG_CONFIG_EXECUTABLE=/usr/bin/pkg-config
SSE2_FOUND=true
SSE3_FOUND=true
SSE4_1_FOUND=true
SSE4_2_FOUND=true
SSSE3_FOUND=true
Tcmalloc_INCLUDE_DIR=/usr/include
Tcmalloc_LIBRARY=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so
USE_APPLE_ACCELERATE=OFF
USE_CCACHE=OFF
USE_CUDNN=OFF
USE_DOXYGEN=ON
USE_FBGEMM=OFF
USE_MKL=ON
USE_MPI=OFF
USE_NCCL=ON
USE_SENTENCEPIECE=on
USE_STATIC_LIBS=OFF

An indentical setup but with 1080Ti (and 455 drivers) does work, though.

srdecny commented 3 years ago

So I managed to resolve the error. In both cases (3060Ti and 1080Ti) I was using Marian with CUDA 9.2 and it seems like the 30xx series no longer support CUDA 9. After upgrading to CUDA 11.3, the 3060Ti is now working.

For reference, here's the Dockerfile I used for the 3060Ti:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04
RUN apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y git \
    cmake \
    build-essential \ 
    libboost-all-dev \
    libprotobuf17 \
    protobuf-compiler \
    libprotobuf-dev \
    openssl \
    libssl-dev \
    libgoogle-perftools-dev \
    wget

WORKDIR /
RUN git clone https://github.com/marian-nmt/marian
WORKDIR marian/build
RUN cmake -DCOMPILE_SERVER=on -DUSE_SENTENCEPIECE=on ..
RUN make -j32

jie-tu914 commented 2 years ago

Seen on the brand-new 3090s.

[2021-04-27 18:45:38] Error: Cublas Error: 13 - /home/heafield/marian-dev/src/tensors/gpu/prod.cpp:118: cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_32F, lda, B, CUDA_R_32F, ldb, beta, C, CUDA_R_32F, ldc, CUDA_R_32F, algorithm)
[2021-04-27 18:45:38] Error: Aborted from static void marian::gpu::TypedGemm<float, float>::gemm(cublasHandle_t, marian::gpu::CudaCompute, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) in /home/heafield/marian-dev/src/tensors/gpu/prod.cpp:118

[CALL STACK]
[0x112537d]         marian::gpu::TypedGemm<float,float>::  gemm  (cublasContext*,  marian::gpu::CudaCompute,  cublasOperation_t,  cublasOperation_t,  int,  int,  int,  float const*,  float const*,  int,  float const*,  int,  float const*,  float*,  int) + 0x63d
[0x112c290]         void marian::gpu::  ProdTyped  <float,float>(IntrusivePtr<marian::TensorBase>,  IntrusivePtr<marian::TensorBase> const&,  IntrusivePtr<marian::TensorBase> const&,  bool,  bool,  float,  float) + 0x15f0
[0x112163e]         marian::gpu::  Affine  (IntrusivePtr<marian::TensorBase>,  std::shared_ptr<marian::Allocator>,  IntrusivePtr<marian::TensorBase> const&,  IntrusivePtr<marian::TensorBase> const&,  IntrusivePtr<marian::TensorBase> const&,  bool,  bool,  float,  float,  bool) + 0x3fe
[0xe87a06]                                                            
[0xed20b3]          std::_Function_handler<void (),marian::AffineNodeOp::forwardOps()::{lambda()#1}>::  _M_invoke  (std::_Any_data const&) + 0x1f3
[0xa831f7]          marian::Node::  forward  ()                        + 0x22f
[0xa798da]          marian::ExpressionGraph::  forward  (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>&,  bool) + 0x75a
[0xa7ae72]          marian::ExpressionGraph::  forwardNext  ()         + 0x182
[0xc0a47e]          marian::GraphGroup::  collectStats  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::models::ICriterionFunction>,  std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&,  double) + 0xeae
[0xbf733f]          marian::SyncGraphGroup::  collectStats  (std::vector<std::shared_ptr<marian::Vocab>,std::allocator<std::shared_ptr<marian::Vocab>>> const&) + 0x13f
[0x7f4768]          marian::Train<marian::SyncGraphGroup>::  run  ()   + 0x3e8
[0x71b448]          mainTrainer  (int,  char**)                        + 0xc8
[0x6d7209]          main                                               + 0x89
[0x7f3ab4a14840]    __libc_start_main                                  + 0xf0
[0x718ad9]          _start                                             + 0x29

Aborted

hi,i met the same mistakes seen on 2080tis with cuda 10.1. Have you solve your problem? or you rebuild the marian system

marian-nmt / marian

Cublas Error: 13 #333

Hi, there seems to be a bug when I tried to train the transformer model. It also happened when I tried to use Marian-dev.