google-research / long-range-arena

Long Range Arena for Benchmarking Efficient Transformers
Apache License 2.0
710 stars 77 forks source link

Problem training listops on GPU #30

Closed renebidart closed 3 years ago

renebidart commented 3 years ago

Hello, I'm running into a strange issue training models (performer, bigbird, longformer) on listops. The model works fine on CPU, but on GPU (either on one or multiple v100 16GB) it crashes with a strange error. This doesn't happen with any other dataset I've tried in the benchmark. The error is:

E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:113] Check failed: stream->parent()->GetBlasGemmAlgorithms(&algorithms)

The dataset was created with the included script: PYTHONPATH="$(pwd)":"$PYTHON_PATH" python lra_benchmarks/data/listops.py --output_dir=lra_data/listops Any idea what may be causing this?

jinfengr commented 3 years ago

It seems there are some discussions under https://github.com/tensorflow/tensorflow/issues/9489. Some have found that it could be a GPU memory issue, maybe good to try out if the issue still persists with a small version of the dataset.

renebidart commented 3 years ago

Thanks, using tf.config.experimental.set_visible_devices([], "GPU") fixed it.