Open ericl opened 3 years ago
Not sure if it is related to https://github.com/ray-project/ray/pull/12468. Previous serialization implementation for pytorch would cause importing of torch when importing ray. This could be slow when you are running ray on the cloud for the first time (since loading torch for the first time would raise a lot of disk access).
It seems to be OK in a pure torch installation though, so might not be related.
@ericl thanks for catching this. I might have also experienced some of those symptoms. Could you provide a script (or description of the steps) you ran to overcome this in a PyTorch-based setup? (ideally TensorBoardX will still function well)
Also, do you have plans to solve this in master?
I think anyway it sounds reasonable to have:
pip install ray[torch]
vs
pip install ray[tensorflow]
as two different setups. I would expect each user to only use a single DL framework anyway
@roireshef I resolved the issue by using the pytorch_p36 conda environment from the latest Amazon DLAMI. I'm assuming this is pretty similar to just installing torch with tensorboardX as you mention.
I'm not sure about the root cause, was not able to figure that out yet.
@ericl thanks. I'm using on-premise hardware so I don't have any access to AWS software, but yeah I managed to uninstall tensorflow and run my experiments. There were few fixes lately, so I'm not sure if uninstalling tensorflow specifically helped or whether this has to do with other fixes. Anyway, uninstalling TF is an easy quickfix.
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): Ray nightly (1.0.2), tensorflow2_p36 DLAMI env, pytorch 1.4-1.7
Reproduction (REQUIRED)
If you try to run a RLlib/Tune job like
rllib train --run=A3C --env=CartPole-v0 --num-samples=100 --config='{"num_workers": 16, "min_iter_time_s": 60}'
when both TF and torch are installed, bizarrely the entire machine will grind to a halt. The trials will start incredibly slowly and things like "import ray" take minutes to run.Uninstalling torch fixes the issue (or switching to torch_p36 env and using the torch version of the algo).
Submitting this issue just in case someone else runs into a similar problem.