kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.59k stars 694 forks source link

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

Closed zhujl1991 closed 4 years ago

zhujl1991 commented 4 years ago

Goals

Since Tensorflow 1.14, TensorFlow supports Cluster Propagation feature which "allows TensorFlow workers to be booted independently of each other, and with no knowledge about others". This essentially allows us to add/remove workers on-the-fly. Specifically, this makes two features possible:

  1. Worker Failover: If a worker fails (e.g., OOM) or is evicted (e.g., not enough resource), the training continues. Later once the failed worker restarts, it can join the training job dynamically without interupting the training process.
  2. Scale Workers Up/Down: During the training, we can dynamically add/reduce the number of workers on-the-fly based on the needs. This is particularly helpful for online learning -- use more workers during peak time while less during spare time.

The goal of this proposal is to allow tf-operator to support this.

Current Issues

In order to support the new features, there are a couple of issues that need to be solved:

  1. Support sparse ClusterSpec mentioned here.
  2. Support manually scaling up/down workers.
  3. Status update logic needs to be changed (e.g., failed workers are not supposed to result in training failure).

Implementation Details

The work can be devided into the following tasks:

  1. In TFJobSpec, add a boolean variable AllowDynamicWorker.
  2. When AllowDynamicWorker == true, reconcile TFJobs every single time. Change needs to be made here.
  3. When AllowDynamicWorker == true, use sparse form in TF_CONFIG here.
  4. Handle the cases where worker index is larger than replicas here. When AllowDynamicWorker == true, implement scale-down logic, i.e., remove workers from the one has largest index until the number of workers equals replicas, here. The same change need to be done for service here and here.
  5. Change the status update logic here.
issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.99

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

gaocegege commented 4 years ago

LGTM. It is a helpful feature.

/cc @richardsliu @johnugeorge

zhujl1991 commented 4 years ago

@gaocegege The first PR is here https://github.com/kubeflow/tf-operator/pull/1142 . Looks like I'm not allowed to add you as the reviewer. Can you take a look when you get a chance? Thanks.

ChanYiLin commented 4 years ago

I like the feature, sounds good!

Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training?

I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time.

You can reference to my thesis http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007707605690577 and our implementation called DRAGON https://github.com/NTHU-LSALAB/DRAGON

So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊

zhujl1991 commented 4 years ago

I like the feature, sounds good!

Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training?

I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time.

You can reference to my thesis http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007707605690577 and our implementation called DRAGON https://github.com/NTHU-LSALAB/DRAGON

So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊

Cool! I'll submit a PR, which has been done internally, implementing in a very naive way based on the 4th item in the Implementation Details. The PR can already meet our needs for now. I think we can work together on top of it later making the functionality more sophisticated.

johnugeorge commented 4 years ago

Interesting feature 👍

zhujl1991 commented 4 years ago

Closed in https://github.com/kubeflow/tf-operator/pull/1149