Closed ckyuto closed 1 year ago
I ran this script with the latest ray to see if that fixed things - seems like things are working for me.
The following setup (upgraded ray to nightly, as well as newer tf and horovod versions) worked:
ray==nightly
tensorflow==2.9.0
horovod==0.26.1
Python 3.7.13
cuda 11
Single node with 16 CPUs, 1 GPU, 64gb memory.
I ran the script with a only a few modifications:
python horovod_debug.py --is_local_run --download_data --epochs=10 --use_gpu
Now I'm trying to replicate your env setup more exactly to make sure that I am able to reproduce the original error.
@ckyuto Could you try to see if upgrading to the above versions work for you as well?
Hi @ckyuto , As we discussed in slack channel, we reran your code with the same environment setting you mentioned but still cannot reproduce the error on my side.
python==2.7
tensorflow==2.4.0
horovod==0.23.0
ray==2.1
cuda version: 11.0
GPU version: V100
So I think this is not a bug of Ray. I searched through the error message "Blas GEMM launch failed", and find that most answers can be categorized into the following 3 reasons:
You can try out these solutions and hope they can help!
I'll close this issue for now.
What happened + What you expected to happen
Hi team,
Recently I tried to upgrade our ray tune version from 1.2 to 2.1. However I found that when I run horovod + gpu on ray cluster, it will cause the error like below,
Is there any change that cause this issue?
Versions / Dependencies
Ray:2.1.0 CUDA: 11 Tensorflow: 2.4.0.39 Horovod: 0.23.0.7 Python: 3.7.10
Reproduction script
This is my code, and you can run with
python -m 'linkedin.tensorflow.mnist.mnist_horovod_raytune' --download_data --use_gpu
Issue Severity
High: It blocks me from completing my task.