takikadiri / kedro-boot

A kedro plugin that streamlines the integration between Kedro projects and third-party applications, making it easier for you to develop end-to-end production-ready data science applications.
Apache License 2.0
34 stars 6 forks source link

Literal String Instead of List of Lists in Template_Params Causes Unexpected Behavior in CustomDataset Constructor #6

Open hugocool opened 9 months ago

hugocool commented 9 months ago

We are experiencing an issue where the literal string '[[ data_ids ]]' is being passed to the constructor of our dataset, instead of the expected list of lists. This problem occurs in the context of a FastAPI route with a Kedro session and a custom dataset configuration using a catalog entry.

Here are the details of the issue and our current workaround:

  1. FastAPI Route with Kedro Session:

    In our FastAPI application, we have a route that uses a Kedro session to run a task. The data_ids parameter, which is intended to be a list of integers, is being incorrectly passed as a string. Here's the relevant part of the route:

    # ... [previous code for route setup] ...
    
    @app.post("/api/v1/data", response_model=OutputData)
    async def get_output_data(input_data: InputData):
       data_ids = input_data.data_ids  # data_ids is expected to be a list of ints
    
       # ... [additional code and logging] ...
    
       output = session.run(
           name="web_api", 
           inputs={"input_data": input_data},
           template_params={"data_ids": data_ids},
       )
  2. Catalog Entry Issue and Workaround:

    The issue also manifests in our catalog entry configuration for a custom dataset:

    result:
     type: performance_optimisation_engine.extras.datasets.web_api_dataset.ResultDataset
     credentials: mysql
     data_ids: [[ data_ids ]]

    With this setup, the string '[[ data_ids ]]' is passed to the constructor instead of the list of lists. To address this, we modified the catalog entry to pass data_ids as a string explicitly:

    result:
     type: performance_optimisation_engine.extras.datasets.web_api_dataset.ResultDataset
     credentials: mysql
     data_ids: '[[ data_ids ]]'

    Then, in the constructor of our dataset, we convert this string back into a list of lists:

    s = str(self.data_ids).replace("'", '"')
    self.data_ids = json.loads(s)

While this workaround is functional, it's not ideal. We suspect this might be a bug or at least a feature that requires clarification in the documentation. Any assistance in resolving this issue more elegantly would be greatly appreciated.

Thank you for your time and consideration.

takikadiri commented 9 months ago

Thank you for raising this issue. This highlights an important point in the catalog's template rendering

Basically the template_params is used at each iteration/request for rendering any Jinja template that is present in your dataset's attributes and having [[ ]] as expression. We don't use the default Jinja block start/ending {{ }} because they are interpreted/rendered by kedro (ConfigLoader and TemplatedConfigLoader) at catalog creation. The new OmegaConfigLoader doesn't use Jinja

At iteration/request time the datasets are already initialized and materialized as python object, kedro-boot isn't really rendering a yaml catalog, it is instead alterating/rendering the attributes of an already initialized datasets. Concretely at request time you already have a result dataset/object with data_ids = [[ data_ids ]] that will be rendered before running the pipeline.

Since we're rendering the python object directly (we don't render a yaml), kedro-boot recursively render every string attributes (handle also strings that are imbriqued in a list or dict) and every Path attributes.

We have two problems here. Each one is handled with a line of your workaround

1. [[ expression ]] sometimes ignored by kedro-boot renderer In your use case something weird happened, your self.data_ids attribute was not a string nor a Path. It was initialized as a list in your dataset object, so kedro-boot ignore it. This is because in the loading of catalog.yaml the data_ids attribute was casted as list [ ... ]. So when you added a quote to the data_ids value, it was casted as string, and correctly interpreted by kedro-boot, but you was forced to fix the format with s = str(self.data_ids).replace("'", '"').

This is clearly an edge case of having [[ ]] as block start/ending of the Jinja expression. This lead to inconsistent behaviors when the template (dataset value) contain only the jinja expression dataset_attribute: [[ expression ]]. We should consider switching to another characters. I propose {[ expression ]}. What do you think ?

2. Casting a non-string templatized dataset attribute Jinja rendering always return a string. So in the case of a non-string attribute, you will always need to cast your expression. In your example you can cast it with self.data_ids = json.loads(s) or ast.literal_eval(s). I don't see this as a real problem as it's feel natural to cast somewhere a value that was being received as a text from the web server.

Let me know what you think