[tune][feature request] Kubernetes based trial executor

gehring commented 5 years ago

Describe the problem

Tune is a very powerful hyper-parameter framework but, currently, involves an all or nothing commitment to ray for managing concurrency, scheduling and fault tolerance. I believe tune's API already supports different backends through the trial executor.

Why?

I believe there would be a strong interest in being able to use tune without having to fit experiments within ray/tune, allowing quick transitions from running something by hand (without ray) to running large scale hyper-parameter optimization. Lowering the commitment required for using tune will likely attract users which might have originally shied away, in favor of some simpler solution, e.g., an in-house grid search. This would also validate tune as a hyper-parameter optimization framework first, ray sub-package second, rather than one of ray's tools.

How?

I've been playing around with kubernetes recently, and I think it is a good candidate for a new trial executor implementation. I'm not a kubernetes expert by any mean, but I'm happy to give a crack at it if there is interest in incorporating this into tune.

gehring commented 5 years ago

@richardliaw I've prioritized opening up this issue now so that we can make sure the global tracking solution you mentioned would be compatible with this idea.

I can think of two approaches:

We force result logging into the Trainable API, wrapping any external communication inside _train(). This is the simplest for tune but might not give a particularly elegant solution.
The tracking API supports calls from outside of tune and a ray cluster. This would be the most powerful and flexible solution but would require careful design of the tracking API. With the right API, different backend can be implemented to support different distributed frameworks, e.g., kubernetes.

I am strongly in favor of the approach in (2). It would give the most flexibility to users and allow for an incremental buy-in into ray/tune. Being able to incrementally add parts of the library to your experiment is a very attractive thing for outsiders. In a perfect world, transitioning from an external experiment to one completely wrapped by tune should require only minor changes to how tracking is initialized while not requiring any changes to the actual reporting calls.

I'm not sure how the tracking API is set up now, but I would strongly argue against a singleton design in favor of an OO writer pattern. This would make the tracking's internal state much more transparent while also opening up the possibility of checkpointing any necessary state from outside of tune.

Let me know what you think!

richardliaw commented 5 years ago

Nice! Let's move the tracking discussion onto #4423 and keep this issue specific to Kubernetes integration.

cc @robertnishihara do you have any thoughts?

gaocegege commented 5 years ago

If you are interested in running hyperparameter tuning experiments on Kubernetes, you could have a look at https://github.com/kubeflow/katib/, which is a Kubernetes Native system for AutoML workloads.

But we do not provide the Python API as ray tune does. Katib needs a YAML config to define the search space and trial code. Please have a look at our examples https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/bayesianoptimization-example.yaml

I am not very familiar with ray tune now. I just looked through the code base and docs. While I think it may be better to run ray on Kubernetes and use ray-tune on top of ray instead of using ray-tune directly on Kubernetes.

Since ray is a powerful framework for distributed computing, I think one of the biggest sell points of ray tune is ray-based implementation. If we support ray clusters on Kubernetes well, this feature will be added painlessly and gracefully.

ray-project / ray