kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.56k stars 1.61k forks source link

[feature] allow setting a default of execution caching disabled via a compiler CLI flag and env var #11092

Open DharmitD opened 1 month ago

DharmitD commented 1 month ago

Feature Area

/area backend /area sdk

What feature would you like to see?

Kubeflow Pipelines has a caching feature that allows users to avoid re-running pipeline components (steps in the pipeline) if the system detects that such a component has previously run and its outputs (artifacts) could be reused. The goal is to save time and computation.

By default, the KFP compiler defaults to setting caching enabled on every Component/Task unless the pipeline author calls

task.set_caching_options(False)

In other words:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_1.set_caching_options(False)   <-- task 1 won’t enable caching, but task 2 will ...
                                           even though the author didn’t specify anything about task 2!

Caching disabled is a much more reasonable default.

DSL Example

Caching is controlled on each individual pipeline Component / Task. Here is example KFP DSL code that disables caching for a single task:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   create_dataset_task = create_dataset()
   create_dataset_task.set_caching_options(False)      <-- this task won’t enable caching

Today, the KFP compiler defaults to setting caching enabled on every Component/Task unless the pipeline author calls task.set_caching_options(False)

In other words:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_1.set_caching_options(False)
   # task 1 won’t try to use the cache, but task 2 will ...
   # even though the author didn’t specify anything about task 2!

When we are done with this feature, this will be true:

@dsl.pipeline(name='iris-training-pipeline')
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_3 = create_dataset()
   task_3.set_caching_options(True)
   # tasks 1 and 2 don’t try to use the cache. Task 3 does try to use the cache.

What is the use case or pain point?

We need to fix the KFP compiler to stop enabling caching by default (by setting task.set_caching_options(True)) if the user didn’t ask for that. As described above, the effect of this behavior is that everything tries to use the cache by default, even though caching is disabled by default in the backend.

This might be a significant change, we wish to have a discussion with the KFP community, get consensus on this update and then proceed with making changes. Find a related issue here: https://github.com/kubeflow/pipelines/issues/10839


Love this idea? Give it a 👍.

DharmitD commented 1 month ago

/assign @DharmitD

boarder7395 commented 1 month ago

I see the pain here, but my org expects caching to be the default and requiring every component in a pipeline to enable it would be just as much of a pain as disabling it for each component. Alternative suggestion allow the default to be set at the pipeline level?

@dsl.pipeline(name='iris-training-pipeline', caching=False)
def my_pipeline():
   task_1 = create_dataset()
   task_2 = create_dataset()
   task_3 = create_dataset()
   task_3.set_caching_options(True)
gregsheremeta commented 2 weeks ago

Alternative suggestion allow the default to be set at the pipeline level?

That's a good suggestion, and I think some day we'll get to implementing that. Ref: #10839

my org expects caching to be the default and requiring every component in a pipeline to enable it would be just as much of a pain as disabling it for each component

Yep, we brought this issue up at the August 14, 2024 KFP Community Meeting (agenda, recording), and that was the consensus feeling there too. I suggested an additive change whereby we could set a CLI flag or env var to set the default to disabled, and the meeting attendees were in favor of that. Hence #11142 .

gregsheremeta commented 4 days ago

@DharmitD , per the last couple comments, can you edit the title of this issue?

[feature] Update DSL to have default set to caching disabled -> [feature] allow setting a default of execution caching disabled via a compiler CLI flag and env var