kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.62k stars 1.63k forks source link

[bug] Pipelines generated from kfp 2.10 ignore accelerator #11374

Closed strangiato closed 1 week ago

strangiato commented 1 week ago

When executing or compiling a pipeline using the 2.10 kfp sdk with the following configuration:

task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

The pipeline server ignores the gpu option and is scheduled without the gpu in the resource configuration.

This appears to be a breaking change introduced in 2.10

Environment

Steps to reproduce

  1. Create a python virtual environment

    python -m venv venv
    source venv/bin/activate
  2. Install kfp 2.10

    pip install kfp==2.10
  3. Create the following pipeline with the file name acc-test.py

    
    from kfp import dsl, compiler

@dsl.component() def empty_component(): pass

@dsl.pipeline(name='pipeline-accel') def pipeline_accel(): task = empty_component() task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit("1")

if name == "main": compiler.Compiler().compile(pipeline_accel, 'pipeline.yaml')


4. Compile the pipeline

python acc-test.py


5. Upload the pipeline and trigger an execution.  

The pods created for the step will not include the `nvidia.com/gpu` in the pod spec resources, and the pod will get scheduled on a non-gpu node.

### Expected result

The pod should include the resources definition for the GPUs and the pod should be scheduled on a GPU node.

resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1

### Materials and reference

It looks like the bug was likely introduced in:
https://github.com/kubeflow/pipelines/pull/11097

When compiling the pipeline with 2.10 it renders the following:
    resources:
      accelerator:
        resourceCount: '1'
        resourceType: nvidia.com/gpu

With older version such as 2.9, it renders the following:
    resources:
      accelerator:
        count: '1'
        type: nvidia.com/gpu


Fix in progress:

https://github.com/kubeflow/pipelines/pull/11373

### Labels
<!-- Please include labels below by uncommenting them to help us better triage issues -->

<!-- /area frontend -->
<!-- /area backend -->
<!-- /area sdk -->
<!-- /area testing -->
<!-- /area samples -->
<!-- /area components -->

---

<!-- Don't delete message below to encourage users to support your issue! -->
Impacted by this bug? Give it a 👍. 
HumairAK commented 1 week ago

Thanks Trevor. This is now resolved, and we'll do a patch release on 2.10 to pull in these changes.

vanHavel commented 4 days ago

Could it be that the same issue applies also to CPU requests and limits? After updating to kfp SDK 2.10, we saw pods running with no resources on kfp server 2.3. It seems that in the generated pipeline spec, the keys for CPU / RAM requests and limits have changed as well (probably in https://github.com/kubeflow/pipelines/pull/11097).

gregsheremeta commented 3 days ago

Could it be that the same issue applies also to CPU requests and limits?

yep. Good catch. https://github.com/kubeflow/pipelines/commit/83dcf1a60919f5bcc0c644c8fdff94ad686cad07

Any interest in submitting a fix? You can use https://github.com/kubeflow/pipelines/pull/11373/files as an example.

vanHavel commented 3 days ago

I took a brief look at this, unfortunately it seems to be not that simple as bringing back the old fields as well. Also the datatype has changed from numbers to strings, so one would potentially also need to bring back the old validation / conversion logic as well I believe. I'm afraid I won't have enough time to work on this, also I lack some of the context on why these things were changed and don't want to break something else downstream.

rimolive commented 2 days ago

I opened https://github.com/kubeflow/pipelines/issues/11390 as a follow-up of the CPU/Memory requests/limits, I'll see if I work on a fix very soon.

gregsheremeta commented 2 days ago

I took a brief look at this

@vanHavel thanks for trying! I appreciate it :smile: