kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 698 forks source link

[SDK] Create Job From Docker API #1878

Open andreyvelich opened 1 year ago

andreyvelich commented 1 year ago

Previously, we created create_job_from_func API: https://github.com/kubeflow/training-operator/pull/1659. This API is useful for users who want to quickly convert their training function to a Kubeflow Distributed Training Job, but it is hard to be used for large models since all import/code should be self-contained.

Similar to KFP Containerized Python Components, we can introduce a new API called: create_job_from_docker which helps user converts their training code to a Kubeflow Training Job.

Initially, we can have the following signature:

def create_job_from_docker(
  self,
  name: str,
  namespace: Optional[str] = None,
  job_kind: Optional[str] = None,
  base_image: str = constants.PYTORCHJOB_BASE_IMAGE,
  command: str = None,
  num_worker_replicas: int = None):
    ...

Which is simply constructing Training Job using base image.

In the future, we can introduce target_image, packages_to_install, etc. parameters which allows SDK to build Docker image on a fly using Docker client. User requires to run docker daemon to use it.

Related: https://github.com/kubeflow/common/issues/66.

What do you think @kubeflow/wg-training-leads @tenzen-y @kuizhiqing @yaobaiwei @zw0610 @droctothorpe ?

terrytangyuan commented 1 year ago

+1 for ease of use. Although I would avoid mentioning "docker" which is implementation specific.

andreyvelich commented 1 year ago

Makes sense, any suggestions @terrytangyuan (e.g. create_job_from_image) ?

terrytangyuan commented 1 year ago

What about create_job(func, img) that calls underlying implementation?

andreyvelich commented 1 year ago

Makes sense, so just provide users 1 API called create_job where they can set Custom Resource, function or image and we are going to process the request accordingly, right ?

terrytangyuan commented 1 year ago

Yep exactly this will avoid exploding the list of public APIs.

tenzen-y commented 1 year ago

It's a good idea. SGTM

In the future, we can introduce target_image, packages_to_install, etc. parameters which allows SDK to build Docker image on a fly using Docker client. User requires to run docker daemon to use it.

In future work, it might be better to add parameters to define if push built image to the registry.

johnugeorge commented 1 year ago

/cc @gaocegege

andreyvelich commented 1 year ago

/assign @andreyvelich

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 11 months ago

/lifecycle frozen