[backend] <Artifacts are not accepted as pipeline Parameters>

chriswu99aaa commented 1 year ago

Environment

google-cloud-aiplatform==1.11.0 google-cloud-bigquery==1.0.20 google-cloud-pipeline-components==1.0.20 google-api-core==1.32.0

How did you deploy Kubeflow Pipelines (KFP)?

Deployed on google cloud platform

KFP version:
KFP SDK version:

kfp 1.8.13 kfp-pipeline-spec 0.1.16 kfp-server-api 1.8.5

I am running kfp on Pycharm not on the UI, so I just past on the packages related to kfp here.

Steps to reproduce

I created a python function component using @component decorator, and I used the Input[Dataset], for the input. the whole specification of function looks like this:

def training_auto_arima_forecasting(
    data_in_y_train: Input[Dataset],
    data_in_y_test: Input[Dataset],
    feature_processor: Input[Artifact],
    metrics: Output[Metrics],
    model: Output[Model]
):

When I compile the pipeline, I got the following error: TypeError: The pipeline argument "data_in_y_train" is viewed as an artifact due to its type "Dataset". And we currently do not support passing artifacts as pipeline inputs. Consider type annotating the argument with a primitive type, such as "str", "int", "float", "bool", "dict", and "list".

Expected result

I expect the annotation is correct according to the type specification given by documentation for kfp v2.

Materials and Reference

this is implementation for the pipeline

def auto_arima_pipeline(project_id: str, data_set_id: str, view_name: str, model_name: str,
                               deploy_endpoint: str, metric_name: str, threshold: str, version: str):
    with dsl.ExitHandler(notify_email_task):
         dataset_ingestion = ingest_data(project_id=project_id, data_set_id=data_set_id, view_name=view_name)
         preprocessed_data = preprocessing(dataset_ingestion.outputs['data_out_train'])
         features_data = time_series_transformation(preprocessed_data.outputs['data_out'])

         model_build = training_auto_arima_forecasting(
            features_data.outputs['data_out_train'],
            features_data.outputs['data_out_test'],
            features_data.outputs['transformation_pipeline']
        )

This is time seies feature transformation component

def time_series_transformation(
        data_out_train: Output[Dataset],
        data_out_test: Output[Dataset],
        transformation_pipeline: Output[Artifact],
        data_path: Input[Dataset]
) ->NamedTuple("Outputs", [("column_names", list)]):
        data = pd.read_csv(data_path.path + ".csv", index_col=False)

    y = data.copy()

    # Temperal Data Split
    y_train, y_test = temporal_train_test_split(y, test_size=36)

    ### Export data as artifact ###

    pd.DataFrame(y_train).to_csv(data_out_train.path + ".csv")
    pd.DataFrame(y_test).to_csv(data_out_test.path + ".csv")

    full_processor = ColumnTransformer(
        transformers=[]
    )
    file_name = transformation_pipeline.path + f".pkl"
    with open(file_name, 'wb') as file:
        pickle.dump(full_processor, file)

    column_names = list(y.columns)
    return (column_names,)

Impacted by this bug? Give it a 👍.

chriswu99aaa commented 1 year ago

Traceback (most recent call last):
  File "C:\Users\shuaimin.wu\GCP_AI\mlops-vertexai\gcp_ai_mlops_accelerator\utils\compile-pipeline.py", line 146, in <module>
    compile(args.pipeline_type, pipeline_filename)
  File "C:\Users\shuaimin.wu\GCP_AI\mlops-vertexai\gcp_ai_mlops_accelerator\utils\compile-pipeline.py", line 67, in compile
    compiler.Compiler().compile(
  File "C:\Users\shuaimin.wu\mlops\lib\site-packages\kfp\v2\compiler\compiler.py", line 1303, in compile
    pipeline_job = self._create_pipeline_v2(
  File "C:\Users\shuaimin.wu\mlops\lib\site-packages\kfp\v2\compiler\compiler.py", line 1213, in _create_pipeline_v2
    raise TypeError(
TypeError: The pipeline argument "data_in_y_train" is viewed as an artifact due to its type "Dataset". And we currently do not support passing artifacts as pipeline inputs. Consider type annotating the argument with a primitive type, such as "str", "int", "float", "bool", "dict", and "list".

This is the full error message

zijianjoy commented 1 year ago

cc @connor-mccarthy

connor-mccarthy commented 1 year ago

@chriswu99aaa, the error message is correct for KFP SDK v1; we don't support specifying artifact inputs or outputs in the pipeline interface. I suggest migrating to KFP SDK v2 (docs on artifacts) if you would like to do this.

Alternatively, consider using a KFP v1 importer component.

kubeflow / pipelines