kimiyoung / transformer-xl

Apache License 2.0
3.61k stars 763 forks source link

CUBLAS_STATUS_EXECUTION_FAILED and Blas GEMM launch failed #139

Open CaoYiqingT opened 3 years ago

CaoYiqingT commented 3 years ago

I have followed the required tensorflow 1.12 and python 2.7, but the following errors still raised. I wonder if you could help me. By the way, it is suggested by the internet that the CUBLAS_STATUS_EXECUTION_FAILED raises when tensorflow version does not match the cuda version. Could you please tell me the gpu type and cuda version you used to train? Looking forward to your reply. `2021-09-08 23:23:08.258752: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "train_gpu.py", line 475, in tf.app.run() File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 471, in main evaluate(n_token, cutoffs, "/gpu:0") File "train_gpu.py", line 446, in evaluate fetched = sess.run(fetches, feed_dict=feed_dict) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(31424, 1024), b.shape=(1024, 3072), m=31424, n=3072, k=1024 [[node transformer/layer_0/rel_attn/qkv/Tensordot/MatMul (defined at /home/caoyq/transformer-xl-master/tf/model.py:54) = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/layer_0/rel_attn/qkv/Tensordot/Reshape, transformer/layer_0/rel_attn/qkv/kernel/read)]]

Caused by op u'transformer/layer_0/rel_attn/qkv/Tensordot/MatMul', defined at: File "train_gpu.py", line 475, in tf.app.run() File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_gpu.py", line 471, in main evaluate(n_token, cutoffs, "/gpu:0") File "train_gpu.py", line 400, in evaluate mems=mems_i) File "train_gpu.py", line 218, in single_core_graph is_training=is_training) File "train_gpu.py", line 186, in model_fn proj_same_dim=FLAGS.proj_same_dim)

return self.__call__(inputs, *args, **kwargs)

File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 374, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in call outputs = self.call(inputs, *args, *kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/keras/layers/core.py", line 963, in call outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]]) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2985, in tensordot ab_matmul = matmul(a_reshape, b_reshape) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2057, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4560, in mat_mul name=name) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, kwargs) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/caoyq/anaconda3/envs/tensorflow_cp27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(31424, 1024), b.shape=(1024, 3072), m=31424, n=3072, k=1024 [[node transformer/layer_0/rel_attn/qkv/Tensordot/MatMul (defined at /home/caoyq/transformer-xl-master/tf/model.py:54) = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/layer_0/rel_attn/qkv/Tensordot/Reshape, transformer/layer_0/rel_attn/qkv/kernel/read)]] `

pyh314 commented 1 month ago

Hello,I have met the same issue,could you please tell me how to solve the problem? Thanks!