`mode: "a"` in `pandas.CSVDataset` still overwrites the file

astrojuanlu commented 9 months ago

Description

As per title.

Context

Originally reported in https://linen-slack.kedro.org/t/15705930/hi-everyone-i-have-an-easy-question-slightly-smiling-face-wh#b644730d-0683-4426-8e40-4e7ef96d8cc7

Steps to Reproduce

Starting from a pandas-iris, I tweaked the pipeline like this:

from kedro.pipeline import Pipeline, node, pipeline

def add_data(df):
    new_data = df.iloc[len(df) - 1:]
    new_data.index = [new_data.index[0] + 1]
    return new_data

def create_pipeline(**kwargs):
    return pipeline(
        [
            node(
                func=add_data,
                inputs=["example_iris_data"],
                outputs="new_data",
            ),
        ]
    )

And the catalog looks like this:

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv

new_data:
  type: pandas.CSVDataSet
  filepath: data/03_primary/new_data.csv
  save_args:
    mode: a
    header: false

Expected Result

The new rows get appended to the file.

Actual Result

The file is silently overwritten every time.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used (pip show kedro or kedro -V): 0.18.13
Kedro plugin and kedro plugin version used (pip show kedro-airflow):
Python version used (python -V): 3.11
Operating system and version: macOS Ventura

noklam commented 9 months ago

https://github.com/kedro-org/kedro-plugins/issues/513

Root cause of this is we hardcoded a mode ="wb", this is not consistently so we need to review all the dataset at once.

This is part of the reason why using generator is hard

emilio-gagliardi commented 9 months ago

Thank you astrojuanlu for looking into this!

kedro-org / kedro-plugins