kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.85k stars 893 forks source link

[KED-3028] Consider giving a more useful error when MemoryDataSet serialisation fails #1318

Closed merelcht closed 2 years ago

merelcht commented 2 years ago

User was trying to pass a list of Spark DataFrames and received a confusing

MemoryDataSet(). maximum recursion depth exceeded .

From Slack thread:

Deepyaman: FYI issue occurs because of serialization; minimal example:

>> from pyspark.sql import SparkSession
>> spark = SparkSession.builder.getOrCreate()
>> dfs = [spark.createDataFrame([{'name': 'Alice', 'age': age}]) for age in range(3)]
>> from kedro.io import MemoryDataSet
>> data_set = MemoryDataSet(data=dfs)

To avoid this, have to set copy_mode because _infer_copy_mode doesn't look inside the list:

data_set = MemoryDataSet(data=dfs, copy_mode='assign')

You can also reproduce the recursion error just by doing copy.deepcopy(dfs[0]).

In memory dataset we fix this issue when it's a single DataFrame being returned implicitly:

    if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray):
        copy_mode = "copy"
    elif type(data).__name__ == "DataFrame":
        copy_mode = "assign"
    else:
        copy_mode = "deepcopy"

However because this is an iterable, we hit the else condition.

We could get in a a game where we start type checking iterable items, but it doesn't scale to all cases.

One option is to catch a RuntimeError and suggest you explicitly declare the assign copy mode

merelcht commented 2 years ago

Hi @deepyaman, how often have you seen this issue happen?

deepyaman commented 2 years ago

@MerelTheisenQB Sorry for missing this. TBH I've only ever seen it occur in this one instance that @miroslav-kral raised.

merelcht commented 2 years ago

Thanks @deepyaman! In that case I'll close this issue for now, and we can always re-open it if it happens more frequently.

datajoely commented 2 years ago

Sorry I've seen this quite a lot! I would love to see a better message suggesting different options available