kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Make Partitioned Dataset Lazy Saving example more robust #3052

Open cverluiseQB opened 1 year ago

cverluiseQB commented 1 year ago

Partitioned Dataset Lazy Saving

Problem

When using partitioned datasets with lazy saving in Kedro, following the current documentation example, the key value mapping is messed up. Specifically, the same value is saved for all keys.

Expected Behavior

Without lazy loading (lambda), everything works as expected, with keys and values correctly associated.

Example

Expand example (on click) **Example dataset** ``` shell ! mkdir ../data/01_raw/bug ! parallel 'touch ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j ! parallel 'echo "{}" >> ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j ``` **Example catalog** ``` yml input_files: type: PartitionedDataset path: ../data/01_raw/bug/ dataset: pandas.CSVDataSet filename_suffix: .csv overwrite: True output_files: type: PartitionedDataset path: ../data/02_intermediate/bug/ dataset: pandas.CSVDataSet filename_suffix: .csv overwrite: True ``` **Example node** ``` py def copy_paste(input_loader): df = input_loader() return df def copy_paste_node(input_files): return {k: lambda: copy_paste(v) for k, v in input_files.items()} ``` **Run node** ``` py from kedro.io import DataCatalog # Create a data catalog from the configuration conf_catalog = yaml.safe_load(catalog) data_catalog = DataCatalog.from_config(conf_catalog) # data input input_files = data_catalog.load("input_files") # function call output_files = copy_paste_node(input_files) # data output data_catalog.save("output_files", output_files) ``` **Observe bug** ``` py from pathlib import Path filepaths = Path("../data/02_intermediate/bug/").glob("*.csv") for filepath in filepaths: print(filepath.name, filepath.read_text(), sep=": ") # a.csv: j -> Expected a.csv: a # c.csv: j -> Expected c.csv: c # b.csv: j -> Expected b.csv: b # f.csv: j -> ... # g.csv: j # e.csv: j # d.csv: j # i.csv: j # h.csv: j # j.csv: j ```

Fix

Complete the lambda with arguments to make it aware of the actual value(s) passed to address the lambda scoping issue.

In our example, this means the following:

def copy_paste_node(input_files):
    return {k: lambda v=v: copy_paste(v) for k, v in input_files.items()}

# `lambda:` becomes `lambda v=v:`

Environment

Resources

Slack discussion 🤗

noklam commented 1 year ago

Thank you for raising this issue, it's very well written. Our team will have a look shortly.

noklam commented 1 year ago

From my understanding, this is not a Kedro problem. It's how Lambda variable scope work. See this example

In [7]: iterable = [lambda: print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: print("Assign the variable to lambda scope")
   ...: 
   ...: iterable = [lambda x=x : print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: 
3
3
3
3
Assign the variable to lambda scope
0
1
2
3

This StackOverFlow thread explains better: https://crawler.algolia.com/admin/crawlers/189d20ee-337e-4498-8a4c-61238789942e/overview

cverluiseQB commented 1 year ago

In that case, I think that this is mainly a matter a documenting the right approach in the doc. wdyt?

noklam commented 1 year ago

I have marked this as a documentation effort.

I suggest that for the person who pick up this ticket, we can put a note section to warn about the scope of lambda, and check if we can improve our docs example, maybe add this to a faq.

stichbury commented 1 year ago

This is something to tackle as part of #2941