kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
138 stars 45 forks source link

JobSetTemplate API #573

Open ahg-g opened 4 months ago

ahg-g commented 4 months ago

What would you like to be added: A JobSetTemplate API similar to PodTemplate.

Why is this needed: APIs building on top of JobSet requires referencing a JobSet spec. The common approach is to embed that JobSet spec inside the higher level API, which makes it hard to validate, the other approach is to reference a template.

ahg-g commented 4 months ago

/feature

tenzen-y commented 4 months ago

/kind feature

googs1025 commented 3 months ago

Hello, I want to share some simple ideas, I don’t know if they are what we need.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSetTemplate
metadata:
  name: my-jobset-template
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
    - name: workers
      replicas: 1
      template:
        spec:
          backoffLimit: 0
          completions: 2
          parallelism: 2
          template:
            spec:
              containers:
                - name: worker
                  image: bash:latest
                  command:
                    - bash
                    - -xc
                    - |
                      sleep 1000
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: my-jobset
spec:
  templateRef:
    name: my-jobset-template 
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: paralleljobs
spec:
  replicatedJobs:
    - name: workers
      templateRef: my-jobset-template
    - name: driver
      templateRef: my-jobset-template
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSetTemplate
metadata:
  name: my-jobset-template
spec:
  replicas: 3
  template:
    spec:
      parallelism: 1
      completions: 1
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 100s

If this approach is correct, perhaps we need another CR object and a controller to manage it. I'm sorry if I misunderstood. Please forgive me if I got it wrong.

googs1025 commented 3 months ago

@ahg-g @danielvegamyhre @kannon92 Could you please check if this is the way I understand it? If so, I will take it when I have time and write a kep design document

kannon92 commented 2 months ago

I’d look at how CronJob uses JobTemplates or even how JobSet uses a JobTemplate.

A user should create a jobset without using the templates.

TrainJob could specify a template and that template would be used to create a Jobset. I think that’s the flow.

Generally the templates are used if someone wants to compose the object.

googs1025 commented 2 months ago

I’d look at how CronJob uses JobTemplates or even how JobSet uses a JobTemplate.

A user should create a jobset without using the templates.

TrainJob could specify a template and that template would be used to create a Jobset. I think that’s the flow.

Generally the templates are used if someone wants to compose the object.

Perhaps we can create a JobSetTemplateController to manage objects like JobSetTemplate. JobSetTemplate is template metadata. JobSet objects can reference this object. But I'm not sure if this is a good design

andreyvelich commented 2 months ago

According to this proposal: https://github.com/kubeflow/training-operator/pull/2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.

Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

tenzen-y commented 2 months ago

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.

Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

As my understanding, @ahg-g mentioned that he wants to try supporting this JobSetTemplate feature regardless of TrainigOperator v2.

danielvegamyhre commented 2 months ago

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models. Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

As my understanding, @ahg-g mentioned that he wants to try supporting this JobSetTemplate feature regardless of TrainigOperator v2.

Yes, we have another use case where JobSetTemplate would be useful - I can't elaborate much further right now since it isn't public yet, but there are definitely other use cases :)