Open gehring opened 5 years ago
@richardliaw I've prioritized opening up this issue now so that we can make sure the global tracking solution you mentioned would be compatible with this idea.
I can think of two approaches:
_train()
. This is the simplest for tune but might not give a particularly elegant solution.I am strongly in favor of the approach in (2). It would give the most flexibility to users and allow for an incremental buy-in into ray/tune. Being able to incrementally add parts of the library to your experiment is a very attractive thing for outsiders. In a perfect world, transitioning from an external experiment to one completely wrapped by tune should require only minor changes to how tracking is initialized while not requiring any changes to the actual reporting calls.
I'm not sure how the tracking API is set up now, but I would strongly argue against a singleton design in favor of an OO writer pattern. This would make the tracking's internal state much more transparent while also opening up the possibility of checkpointing any necessary state from outside of tune.
Let me know what you think!
Nice! Let's move the tracking discussion onto #4423 and keep this issue specific to Kubernetes integration.
cc @robertnishihara do you have any thoughts?
If you are interested in running hyperparameter tuning experiments on Kubernetes, you could have a look at https://github.com/kubeflow/katib/, which is a Kubernetes Native system for AutoML workloads.
But we do not provide the Python API as ray tune does. Katib needs a YAML config to define the search space and trial code. Please have a look at our examples https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/bayesianoptimization-example.yaml
I am not very familiar with ray tune now. I just looked through the code base and docs. While I think it may be better to run ray on Kubernetes and use ray-tune on top of ray instead of using ray-tune directly on Kubernetes.
Since ray is a powerful framework for distributed computing, I think one of the biggest sell points of ray tune is ray-based implementation. If we support ray clusters on Kubernetes well, this feature will be added painlessly and gracefully.
Describe the problem
Tune is a very powerful hyper-parameter framework but, currently, involves an all or nothing commitment to ray for managing concurrency, scheduling and fault tolerance. I believe tune's API already supports different backends through the trial executor.
Why?
I believe there would be a strong interest in being able to use tune without having to fit experiments within ray/tune, allowing quick transitions from running something by hand (without ray) to running large scale hyper-parameter optimization. Lowering the commitment required for using tune will likely attract users which might have originally shied away, in favor of some simpler solution, e.g., an in-house grid search. This would also validate tune as a hyper-parameter optimization framework first, ray sub-package second, rather than one of ray's tools.
How?
I've been playing around with kubernetes recently, and I think it is a good candidate for a new trial executor implementation. I'm not a kubernetes expert by any mean, but I'm happy to give a crack at it if there is interest in incorporating this into tune.