karthikv2k commented 5 years ago

Problem

Currently in Kubeflow, we have a controller per framework (e.g. TF-Job, and PyTorch-Operator) and to support a new framework, the message we are giving is that users have to write a new controller. This is a lot of friction for data scientists who most likely don’t know Go-Lang and K8s. Even if they do, getting a version of controller deployed in a corp cluster is not easy.

Proposed Solution

However, in reality users actually don’t have to write a new controller if they have a generic Gang scheduling API and in fact TF-Job controller exposes a restricted version of the API that works for almost all of the use cases. In fact, the Google AI Platform team implemented distributed PyTorch and XGBoost jobs using TF-Job API for the Google AI-Hub. So if we can create a controller for gang scheduling it will make it easy to add support for new frameworks.

Advantages

Less effort to support a new framework (users don’t need K8s or Go-Lang expertise) A better story for portability between Kubeflow and other platforms like Mesos. The same container can be used in other platforms without any changes.

Other infras that support some version of gang scheduling API

Google AI Platform training
Amazon Sagemaker
Mesos/Uber’s Michelangelo
YARN (not natively but with some application logic in Application Master, e.g. TonY)
Frameworks Support

From my understanding, distributed training for following frameworks can be implemented easily using just a generic gang scheduling.
TensorFlow
Horovod
PyTorch
XGBoost
Julia

LightGBM

Rough API Spec

Almost same as current tf-job spec but with more generic names and generalizing #worker groups.

apiVersion: kubeflow.org/v1beta1
kind: GangJob
metadata:
generateName: gangjob
namespace: kubeflow
spec:
replicaSpecs:
WorkerGroup1:
  replicas: 4
  restartPolicy: OnFailure
  template:
    spec:
      containers:
      - name: 
        image: 
        command:
WorkerGroup2:
  replicas: 3
  restartPolicy: OnFailure
  template:
    spec:
      containers:
      - name: 
        image: 
        Command:
.
.
.
WorkerGroupN:
  replicas: 1
  restartPolicy: OnFailure
  template:
    spec:
      containers:
      - name: 
        image: 
        command:

issue-label-bot[bot] commented 5 years ago

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.86. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

karthikv2k commented 5 years ago

CC @richardsliu @abhi-g

richardsliu commented 5 years ago

/cc @k82cn /cc @gaocegege /cc @johnugeorge

johnugeorge commented 5 years ago

In this proposal, How is the distributed training environment setup for each framework? Eg: TF_CONFIG env in tensorflow(https://www.tensorflow.org/guide/distribute_strategy#setting_up_tf_config_environment_variable) and MASTER_ADDR, MASTER_PORT etc in pytorch (https://pytorch.org/tutorials/intermediate/dist_tuto.html#initialization-methods)

It looks similar to the common operator discussion that would support all frameworks that are described in the proposal.

k82cn commented 5 years ago

Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the SchedulingSpec to communicate with kube-batch; for the other part, e.g. multiple pod template, it's more about controller instead of scheduling policy. Both of them are fundamental feature to k8s. please refer to https://github.com/kubernetes/kubernetes/issues/68357 , http://github.com/volcano-sh/volcano on what we're doing there :)

karthikv2k commented 5 years ago

In this proposal, How is the distributed training environment setup for each framework?

The user's code will be responsible for setting the right environment variables for the framework they are using. The gang scheduler/controller cab set a cluster spec in an env variable that is similar to TF_CONFIG. Taking a cluster spec and convert into framework specific env should be trivial.

It looks similar to the common operator discussion that would support all frameworks that are described in the proposal.

where can find "common operator discussion"? form the name it looks similar.

karthikv2k commented 5 years ago

Some input here: Gang-scheduling/coscheduling is the requirements to the scheduler, so common operator defined the SchedulingSpec to communicate with kube-batch; for the other part, e.g. multiple pod template, it's more about controller instead of scheduling policy. Both of them are fundamental feature to k8s. please refer to kubernetes/kubernetes#68357 , http://github.com/volcano-sh/volcano on what we're doing there :)

@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that! However, I haven't got a clear idea on when to use volcano vs a Kubeflow job operator. Are these complimentary offerings?

k82cn commented 5 years ago

@k82cn https://github.com/volcano-sh/volcano/blob/master/docs/design/job-api.md describes everything that I need and even goes beyond that!

Very glad to hear that :)

However, I haven't got a clear idea on when to use volcano vs a Kubeflow job operator. Are these complimentary offerings?

Volcano is to enhance k8s's batch capability (based on kubernetes/kubernetes#68357 ); Kubeflow is easier for user to use ML frameworks :)

And we're going to work together on batch scheduling part :)

johnugeorge commented 5 years ago

@karthikv2k This is the issue tracking it https://github.com/kubeflow/tf-operator/issues/960

kubeflow / common

Proposal: using gang scheduling API for generic distributed training support in Kubeflow #37

Problem

Proposed Solution

Advantages

Other infras that support some version of gang scheduling API

Frameworks Support

Rough API Spec