kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
84 stars 76 forks source link

`mode: "a"` in `pandas.CSVDataset` still overwrites the file #336

Open astrojuanlu opened 9 months ago

astrojuanlu commented 9 months ago

Description

As per title.

Context

Originally reported in https://linen-slack.kedro.org/t/15705930/hi-everyone-i-have-an-easy-question-slightly-smiling-face-wh#b644730d-0683-4426-8e40-4e7ef96d8cc7

Steps to Reproduce

Starting from a pandas-iris, I tweaked the pipeline like this:

from kedro.pipeline import Pipeline, node, pipeline

def add_data(df):
    new_data = df.iloc[len(df) - 1:]
    new_data.index = [new_data.index[0] + 1]
    return new_data

def create_pipeline(**kwargs):
    return pipeline(
        [
            node(
                func=add_data,
                inputs=["example_iris_data"],
                outputs="new_data",
            ),
        ]
    )

And the catalog looks like this:

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv

new_data:
  type: pandas.CSVDataSet
  filepath: data/03_primary/new_data.csv
  save_args:
    mode: a
    header: false

Expected Result

The new rows get appended to the file.

Actual Result

The file is silently overwritten every time.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

noklam commented 9 months ago

https://github.com/kedro-org/kedro-plugins/issues/513

Root cause of this is we hardcoded a mode ="wb", this is not consistently so we need to review all the dataset at once.

This is part of the reason why using generator is hard

emilio-gagliardi commented 9 months ago

Thank you astrojuanlu for looking into this!