The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.
Core tasks:
[ ] Case study described that communicates the business value of the example; who will care about this example and why?
[x] Illustrate the config, submit, monitor, render workflow for single-node training
[ ] Prow test verifies model trains in notebook container
[ ] Illustration of practice for building and pushing containers efficiently
[ ] Distributed training with TFJob operator (e.g. using @danijar's idea)
[ ] Illustration of simple hyperparameter tuning
[ ] Uses accelerators
Optionally:
[ ] Build a custom gym environment that captures a business problem of interest, e.g. reinforcement learning in the context of datacenter cooling, scheduling, hyperparameter tuning, etc.
[ ] Deploy the agent and custom environment, e.g. if this environment concerns kubernetes scheduling then use it to schedule resources on a cluster and measure whether there was a benefit
The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.
Core tasks:
Optionally:
/cc @nkashy1 @danijar @aronchick @jlewi