Closed mtyrolski closed 3 years ago
export TF_FORCE_GPU_ALLOW_GROWTH=true
export LD_LIBRARY_PATH=/usr/local/cuda-11/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/cuda/lib64:$LD_LIBRARY_PATH
XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda python3 -m trax.trainer
fixed the problem.
Description
I try to train the model on the cluster and constantly get an error as soon as the model starts training:
I tried a lot of proposed solutions from tensorflow issues like https://github.com/tensorflow/tensorflow/issues/24496 but unfortunately none of them helps. Important note - the issue occurs if and only if we use Convolution layer in our model.
Environment information
We use the newest version of the trax.
Steps to reproduce:
...