kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.55k stars 1.6k forks source link

TypeError: to_dict() missing 1 required positional argument: 'self' #10500

Closed eash11 closed 4 months ago

eash11 commented 6 months ago

Python 3.9.13

Steps to reproduce

I have created the following functions and components to run a simple pretrained model for text summarization. Input data is a CSV file with rows of text data for which I wanted to summarize for each one of them.

Step 1 :

import kfp.dsl
import pandas as pd

@kfp.dsl.component
def read_csv_file(file_path: str):
    return pd.read_csv(file_path)

Step 2 :

import kfp.dsl
from transformers import AutoTokenizer
import transformers
from pandas import DataFrame

model = "meta-llama/Llama-2-7b-chat-hf"

@kfp.dsl.component
def preprocess_text(df: DataFrame):
    # Tokenize the text data using AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(model)
    df['encoded_text'] = df['text'].apply(lambda text: tokenizer.encode(text, max_length=512, truncation=True))
    return df

I still have the following steps to create the functions for

  1. defining the function for text summarization calling the llama2-7b-chat-hf using the df from the function given above preprocess_text()
  2. publish the results from the previous step text summarization
  3. define the pipeline with the following sequence of actions
    1. read_csv -> preprocess text using Autotokenizer -> perform text summarization -> publish the results as csv
  4. compile the pipeline and see the results.

I am getting the following error while running the code for the function given above #_preprocesstext() itself in STEP 2.

Following is the error log:


TypeError Traceback (most recent call last) c:\Users\eashwar_n\kubeflow_experiment.ipynb Cell 3 line 1 4 from pandas import DataFrame 6 model = "meta-llama/Llama-2-7b-chat-hf" 9 @kfp.dsl.component ---> 10 def preprocess_text(df: DataFrame): 11 # Tokenize the text data using AutoTokenizer 12 tokenizer = AutoTokenizer.from_pretrained(model) 13 df['encoded_text'] = df['text'].apply(lambda text: tokenizer.encode(text, max_length=512, truncation=True))

File d:\myenv\pythonProject1\venv\lib\site-packages\kfp\dsl\component_decorator.py:119, in component(func, base_image, target_image, packages_to_install, pip_index_urls, output_component_file, install_kfp_package, kfp_package_path) 108 if func is None: 109 return functools.partial( 110 component, 111 base_image=base_image, (...) 116 install_kfp_package=install_kfp_package, 117 kfp_package_path=kfp_package_path) --> 119 return component_factory.create_component_from_func( 120 func, 121 base_image=base_image, 122 target_image=target_image, 123 packages_to_install=packages_to_install, 124 pip_index_urls=pip_index_urls, 125 output_component_file=output_component_file, 126 install_kfp_package=install_kfp_package, 127 kfp_package_path=kfp_package_path)

File d:\myenv\pythonProject1\venv\lib\site-packages\kfp\dsl\component_factory.py:556, in create_component_from_func(func, base_image, target_image, packages_to_install, pip_index_urls, output_component_file, install_kfp_package, kfp_package_path) 552 else: 553 command, args = _get_command_and_args_for_lightweight_component( 554 func=func) --> 556 component_spec = extract_component_interface(func) 557 component_spec.implementation = structures.Implementation( 558 container=structures.ContainerSpecImplementation( 559 image=component_image, 560 command=packages_to_install_command + command, 561 args=args, 562 )) 564 module_path = pathlib.Path(inspect.getsourcefile(func))

File d:\myenv\pythonProject1\venv\lib\site-packages\kfp\dsl\component_factory.py:422, in extract_component_interface(func, containerized, description, name) 419 return None 421 signature = inspect.signature(func) --> 422 name_to_input_spec, name_to_output_spec = get_name_to_specs( 423 signature, containerized) 424 original_docstring = inspect.getdoc(func) 425 parsed_docstring = docstring_parser.parse(original_docstring)

File d:\myenv\pythonProject1\venv\lib\site-packages\kfp\dsl\component_factory.py:271, in get_name_to_specs(signature, containerized) 265 name_to_output_specs[maybe_make_unique( 266 name, 267 list(name_to_output_specs))] = make_output_spec(annotation) 269 # parameter type 270 else: --> 271 type_string = type_utils._annotation_to_type_struct(annotation) 272 name_to_input_specs[maybe_make_unique( 273 name, list(name_to_input_specs))] = make_input_spec( 274 type_string, func_param) 276 ### handle return annotations ###

File d:\myenv\pythonProject1\venv\lib\site-packages\kfp\dsl\types\type_utils.py:556, in _annotation_to_type_struct(annotation) 554 return None 555 if hasattr(annotation, 'to_dict'): --> 556 annotation = annotation.to_dict() 557 if isinstance(annotation, dict): 558 return annotation

TypeError: to_dict() missing 1 required positional argument: 'self'

Expected result

Final CSV must have the summarized text for each of the text records present in the CSV in a new column

Materials and reference

Labels


Impacted by this bug? Give it a 👍.

geier commented 5 months ago

You can't pass around DataFrames like this. Have a look at how to handle artifacts https://www.kubeflow.org/docs/components/pipelines/v2/data-types/artifacts/

rimolive commented 4 months ago

/close

As @geier commented, this is not the correct way to pass pandas DataFrames in kfp.

google-oss-prow[bot] commented 4 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/10500#issuecomment-2059170937): >/close > >As @geier commented, this is not the correct way to pass pandas DataFrames in kfp. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.