Add hyperparameter tuning?

kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes

https://www.kubeflow.org/docs/components/training

Apache License 2.0

1.58k stars 689 forks source link

Add hyperparameter tuning? #112

Closed jlewi closed 6 years ago

jlewi commented 6 years ago

Opening this issue to see if there is any interest in adding capabilities to manage hyperparameter tuning.

jimexist commented 6 years ago

support grid search to begin with?

wbuchwalter commented 6 years ago

The more I think about it, the more I think this should not be handled by TfJob but by a higher abstraction, such as the dashboard and other tools that will ultimately interact with TfJob. Users might want to do different strategies that may be difficult to express in YAML, but easier in a frontend. Would there be any optimization that would be possible if we handle HP tuning directly at the TfJob level versus higher up?

jlewi commented 6 years ago

I don't think HP Tuning should be handled by TfJob. I think a hyperparameter tuning system should be implemented as a set of loosely coupled components. Here are some components I see

Launchers - These are controllers (e.g. TfJob, K8s Job controller) that know how to train a model
Tuners - This component knows how to pick new parameters
Evaluators - Knows how to evaluate models.
HP Controller - Coordinates all of the above.

I think we can define appropriate APIs and Interfaces for each component so that particular components aren't tied to implementation details of other components. For example, I don't there's any reason why doing a GridSearch should care whether we are training a DeepNet using TF or a decision tree using XGboost model.

I don't necessarily want to focus on the HP Tuning algorithms. That's beyond my expertise. I'd rather focus on building the infrastructure that makes it easy to plugin in new algorithms.

jimexist commented 6 years ago

from API's point of view, doing HP tuning has no difference from firing up a bunch of parallelable TfJobs (and of course won't care about their relationships, just that they are independent).

bhack commented 6 years ago

Is it arriving in productioction: https://deepmind.com/blog/population-based-training-neural-networks/?

ddutta commented 6 years ago

Would love to see hyper-param tuning added to Kubeflow ... maybe we dont need something as sophisticated like https://research.google.com/pubs/pub46180.html but having something simple for starters might be good

gaocegege commented 6 years ago

@ddutta

Google Vizier is a good reference for us since it applies at large scale. And there is an open source implementation in Jupiter Notebook: https://github.com/tobegit3hub/advisor and we could have a try on it to see if it is what we want to implement.

ddutta commented 6 years ago

@gaocegege Thx. We will try it out. We have a version which we could contribute/merge too.

YujiOshima commented 6 years ago

Hi @gaocegege @ddutta @Jimexist @wbuchwalter @bhack @jlewi I have also been interested in parameter tuning system and I'm developing vizier clone that can integrate with kubernetes. And tuning system itself also work on kubernetes. It’s mostly functional. I use it internally for our team. It supports grid, random, and hyperband search algorithm. I hope to collaborate with KubeFlow community!

The code is here . I will be so happy to get your comments and feedbacks.

gaocegege commented 6 years ago

@YujiOshima

Thanks for the information! I will take a look.

And we created a channel hyperparameter-tuning in slack: https://kubeflow.slack.com/messages/C9ZLKR73L/ Welcome comments :tada:

/cc @DjangoPeng @ddysher

ddutta commented 6 years ago

This is the client library (to show the APIs) of the tool we built internally --> https://github.com/CiscoAI/hyper-advisor-client.git

gaocegege commented 6 years ago

I am closing the issue since we have a new repo for hyperparameter tuning: https://github.com/kubeflow/hp-tuning

Thank you all :tada: