ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.34k stars 5.64k forks source link

Ray grinds to a halt if both PyTorch and TensorFlow are installed #12467

Open ericl opened 3 years ago

ericl commented 3 years ago

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray nightly (1.0.2), tensorflow2_p36 DLAMI env, pytorch 1.4-1.7

Reproduction (REQUIRED)

If you try to run a RLlib/Tune job like rllib train --run=A3C --env=CartPole-v0 --num-samples=100 --config='{"num_workers": 16, "min_iter_time_s": 60}' when both TF and torch are installed, bizarrely the entire machine will grind to a halt. The trials will start incredibly slowly and things like "import ray" take minutes to run.

Uninstalling torch fixes the issue (or switching to torch_p36 env and using the torch version of the algo).

Submitting this issue just in case someone else runs into a similar problem.

suquark commented 3 years ago

Not sure if it is related to https://github.com/ray-project/ray/pull/12468. Previous serialization implementation for pytorch would cause importing of torch when importing ray. This could be slow when you are running ray on the cloud for the first time (since loading torch for the first time would raise a lot of disk access).

ericl commented 3 years ago

It seems to be OK in a pure torch installation though, so might not be related.

roireshef commented 3 years ago

@ericl thanks for catching this. I might have also experienced some of those symptoms. Could you provide a script (or description of the steps) you ran to overcome this in a PyTorch-based setup? (ideally TensorBoardX will still function well)

Also, do you have plans to solve this in master? I think anyway it sounds reasonable to have: pip install ray[torch] vs pip install ray[tensorflow]

as two different setups. I would expect each user to only use a single DL framework anyway

ericl commented 3 years ago

@roireshef I resolved the issue by using the pytorch_p36 conda environment from the latest Amazon DLAMI. I'm assuming this is pretty similar to just installing torch with tensorboardX as you mention.

I'm not sure about the root cause, was not able to figure that out yet.

roireshef commented 3 years ago

@ericl thanks. I'm using on-premise hardware so I don't have any access to AWS software, but yeah I managed to uninstall tensorflow and run my experiments. There were few fixes lately, so I'm not sure if uninstalling tensorflow specifically helped or whether this has to do with other fixes. Anyway, uninstalling TF is an easy quickfix.