google / trax

Trax — Deep Learning with Clear Code and Speed
Apache License 2.0
8.1k stars 816 forks source link

CUDNN_STATUS_INTERNAL_ERROR #1311

Closed mtyrolski closed 3 years ago

mtyrolski commented 3 years ago

Description

I try to train the model on the cluster and constantly get an error as soon as the model starts training:

Failed to get convolution algorithm.

Convolution performance may be suboptimal.
2020-12-16 01:16:35.481299: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481342: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481377: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481416: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:772] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for convolution %custom-call.484 = (f32[3,512,512]{0,1,2}, u8[0]{0}) custom-call(f32[1,3072,512]{1,2,0} %add.822, f32[1,1024,512]{1,2,0} %add.16755), window={size=3 stride=3}, dim_labels=b0f_0io->b0f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_type="conv_general_dilated" op_name="jit(single_device_update_fn)/conv_general_dilated[ batch_group_count=1\n                                                   dimension_numbers=ConvDimensionNumbers(lhs_spec=(2, 0, 1), rhs_spec=(2, 0, 1), out_spec=(1, 2, 0))\n                                                   feature_group_count=1\n                                                   lhs_dilation=(1,)\n                                                   lhs_shape=(1, 3072, 512)\n                                                   padding=((0, 0),)\n                                                   precision=None\n                                                   rhs_dilation=(3,)\n                                                   rhs_shape=(1, 1024, 512)\n                                                   window_strides=(1,) ]" source_file="/home/mtyrolski/vatican_trax_workspace/20201216_005356/venv/lib/python3.8/site-packages/trax/fastmath/jax.py" source_line=53}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 

I tried a lot of proposed solutions from tensorflow issues like https://github.com/tensorflow/tensorflow/issues/24496 but unfortunately none of them helps. Important note - the issue occurs if and only if we use Convolution layer in our model.

Environment information

We use the newest version of the trax.

mesh-tensorflow==0.1.17
tensor2tensor==1.15.7
tensorboard==2.4.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-addons==0.11.2
tensorflow-datasets==4.1.0
tensorflow-estimator==2.3.0
tensorflow-gan==2.0.0
tensorflow-hub==0.10.0
tensorflow-metadata==0.25.0
tensorflow-probability==0.7.0
tensorflow-text==2.3.0
jax==0.2.5
jaxlib==0.1.57

CUDA 10.1.243
cuDNN 7.6.4
Python 3.8.2

Steps to reproduce:

...

TF_FORCE_GPU_ALLOW_GROWTH=true XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda pip3 install --upgrade jax jaxlib==0.1.57+cuda101 -f https://storage.googleapis.com/jax-releases/jax_releases.html
TF_FORCE_GPU_ALLOW_GROWTH=true XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda python3 -m trax.trainer --config_file=1.gin --output_dir=./
mtyrolski commented 3 years ago
export TF_FORCE_GPU_ALLOW_GROWTH=true
export LD_LIBRARY_PATH=/usr/local/cuda-11/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/cuda/lib64:$LD_LIBRARY_PATH
XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda python3 -m trax.trainer

fixed the problem.