kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

fix(sdk): accelerator type setting in kfp #11373

Closed HumairAK closed 1 week ago

HumairAK commented 1 week ago

In this PR, new API fields were introduced to set accelerator resources via pipeline spec. Note that type and count were deprecated. However in the follow up implementation pr here we can see that these fields are removed entirely from the compilation step. This has basically broken setting accelerator types in KFP, because the driver is expecting type and count to be present, and these are thus not being set at all.

This PR re-introduces these fields, see test case for how the pipeline spec will look like. This is to allow for a proper deprecation period on the api, so we can adjust the driver changes accordingly.

To verify the changes you can use the following sample pipeline on the changes before/after:

from kfp import dsl
from kfp import compiler
@dsl.component(base_image="docker.io/python:3.9.17")
def empty_component():
    pass

@dsl.pipeline(name='pipeline-accel')
def pipeline_accel():
    task = empty_component()
    task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

compiler.Compiler().compile(
    pipeline_func=pipeline_accel,
    package_path=__file__.replace('.py', '-v2.yaml'))

After the changes this should compile to:

...
deploymentSpec:
  executors:
    exec-empty-component:
      container:
        args: ..omitted
        command: ..omitted
        image: ...
        resources:
          accelerator:
            count: '1'
            resourceCount: '1'
            resourceType: nvidia.com/gpu
            type: nvidia.com/gpu
...

The resulting executor pod should have:

resources:
        limits:
          nvidia.com/gpu: '1'
        requests:
          nvidia.com/gpu: '1'

Fixes #11374

HumairAK commented 1 week ago

Follow up to this would be to update the driver code, the golang proto changes snuck in here. These should have probably been added with the implementation pr for the accelerator changes. Which makes me think we really should be testing for proto changes and generated compiled diffs. We now just need to update driver code to utilize these fields instead of the deprecated ones.

If we are expecting no other downstream impacts, we can remove these deprecated fields for KFP 2.4 or 2.5

HumairAK commented 1 week ago

cc @chensun PTAL

this one I think is severe enough that it warrants a patch release for 2.10, being unable to use GPUs is no good

chensun commented 1 week ago

cc @chensun PTAL

this one I think is severe enough that it warrants a patch release for 2.10, being unable to use GPUs is no good

Sounds reasonable for a patch release. I'll target it by this week.

google-oss-prow[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chensun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[sdk/OWNERS](https://github.com/kubeflow/pipelines/blob/master/sdk/OWNERS)~~ [chensun] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
gregsheremeta commented 1 week ago

this one I think is severe enough that it warrants a patch release for 2.10, being unable to use GPUs is no good

It should be a hard rule that any time we accidentally make a breaking API change, it warrants a patch release to undo the breakage.

Of course, best if we never let breaking API changes sneak in :sweat_smile:

It would also be good if we have some announcement mechanism to let people know that SDK 2.10.0 is broken in this way. Perhaps we can add a note to the releases page? Shrug.

HumairAK commented 1 week ago

@gregsheremeta agree, FWIW I send out a note in the community slack, but a more permanent notice somewhere would be ideal. We can utilize the KFP discussions for this if we wanted to (we can pin those), or we can also Pin the issue itself