Closed fshriver closed 4 years ago
@fshriver hmm sorry to hear about your setup. Thanks for putting together such a detailed report. Can you enable verbose output from tensorflow and run this again? Also, what if you simply try using 1 GPU per trial instead of 6?
Sure, I've attached the output of the program with verbose = 1 as a text file here. As you can see, it's actually able to see and use the GPUs successfully for the first iteration... it looks like after that is where the issue starts. Perhaps some sort of lock on the resources that the underlying system/Ray don't like? I'm really not familiar enough with the Ray internals to say if that's the case, however.
Also, I'm using 6 GPUs purely because if I set the usage to 1 GPU I get the same log messages you see above, just repeated across 6 different processes. The 6 GPU requirement is just to fill up the node so it captures only one error message.
So I've worked with my cluster's support group and one of them suggested NVIDIA's CUDA Multi-Processing Service (MPS) is to blame; specifically, it isn't enabled on our compute nodes by default, which I didn't know about. Enabling it appears to make the above issue go away. If someone is looking back on this error message in the future, the issue is likely due to some issue with the CUDA Multi-Processing Service; check there!
If there are no objections, I'll be closing this issue tomorrow since it's a non-issue.
What is the problem?
When running Ray Tune to try and optimize some hyperparameter, it's apparently able to train for one iteration (one set of epochs) using TensorFlow however appears to choke up on subsequent iterations. The following is an example of the logs that Ray/TensorFlow generate, from start to finish, from the example script (also below) that reproduces the issue:
As you can see, the main culprit appearst to be from:
which, if I were to run multiple trials at the same time, is an error message that gets repeated on all running processes before the task is aborted. I have confirmed in a separate test that the issue is not with TensorFlow itself, as I am able to make my own network similar to the one in the example script and train it for however many epochs I want just fine. And if I turn on verbose TensorFlow output in the example script shown below, I actually see that TensorFlow does manage to finish one set of epochs via Ray - apparently the issue gets introduced when Ray finishes its initial evaluation and tries to run the trial again.
Ray version: 0.8.1, TF version: 2.1.0, OS: RHEL-7.6, System: IBM POWER9(ppcle64).
Worth noting, I can't update to the latest version of Ray or a different version of TF since Ray currently isn't built for the POWER9 architecture and IBM doesn't want to fully support their ML libraries. Limitation of the system I'm working with.
Other environment information (output from conda list):
Reproduction (REQUIRED)