kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.55k stars 1.6k forks source link

[sdk] List of input paths is consumed as list of input values #6895

Closed skogsbrus closed 2 months ago

skogsbrus commented 2 years ago

Environment

kfp==1.6.4
kfp-pipeline-spec==0.1.10
kfp-server-api==1.7.0

Steps to reproduce

I'm not confident that this isn't a user error, but I haven't found any documentation that describes my use case where a component consumes a list of input paths.

  1. Create file outputs that are too large to pass as values.
  2. Pass them as inputs to a component that expects a list of input paths
  3. The components that created the files fail with the error This step is in Error state with this message: failed to save outputs: Request entity too large: limit is 3145728. According to https://github.com/kubeflow/pipelines/issues/3134, this suggests that the final component is trying to consume them as values as opposed to paths.

Expected result

The created paths should be passed as InputPaths, not as InputValues

Materials and Reference

Source code:

import kfp
from kfp import dsl
from kfp.components import InputPath, OutputPath
from typing import List

def create_component_from_func():
    def decorator(func):
        return kfp.components.create_component_from_func(func=func)
    return decorator

@create_component_from_func()
def list_of_input_paths_op(input_paths: List[InputPath]):
    for p in input_paths:
        print(f"Got input path {p}")

@create_component_from_func()
def create_file_op(output_path: OutputPath()):
    from pathlib import Path
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "wb") as out:
        out.seek((1024 * 1024 * 10) - 1)
        out.write(b'\0')

@dsl.pipeline(name="list-of-input-paths", description="Demonstrates a bug")
def pipeline():
    a = create_file_op()
    b = create_file_op()
    list_of_input_paths_op([a.output, b.output])

if __name__ == "__main__":
    kfp.compiler.Compiler().compile(pipeline, "list_of_inputpaths.yml")

If I specify the type hint for list_of_input_paths_op as List[InputPath()] instead, I get an error:

TypeError: Parameters to generic types must be types. Got <kfp.components._python_op.InputPath object at 0x7fa54173caf0>.

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

JonChanGit commented 1 year ago

I'm having this problem too, any update?

skogsbrus commented 1 year ago

If you are able to structure your code so that you pass a directory instead (containing the files you're interested in), I think that should work. Alternatively you can use mounted volumes to pass arbitrary files.

But AFAIK there's no fix for this specific issue.

gpotti commented 1 year ago

Hello, I'm very new to Kubeflow, just stumbled on this issue when I was looking for a solution for a similar issue. In my case, I've a Modeling function that consumes a dataframe created by another function and also it requires to consume output path from a previous function that handles the download, so that it can read the files. I was wondering if there is an option to pass multiple InputPaths to a container. As I mentioned, I'm very new to ML and MLOps, so if anything that I've mentioned is clear, I can try to explain again. Please let me know

gpotti commented 1 year ago

Please ignore my question, I figured I can use InputPath twice in the same function and point them to different artifcats from previous containers.

skogsbrus commented 1 year ago

Yup, if you just have two paths that works. It doesn't work well for a large number of paths / dynamic number of paths though.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.