Open cverluiseQB opened 1 year ago
Thank you for raising this issue, it's very well written. Our team will have a look shortly.
From my understanding, this is not a Kedro problem. It's how Lambda variable scope work. See this example
In [7]: iterable = [lambda: print(x) for x in range(4)]
...:
...: for i in iterable:
...: i()
...:
...: print("Assign the variable to lambda scope")
...:
...: iterable = [lambda x=x : print(x) for x in range(4)]
...:
...: for i in iterable:
...: i()
...:
...:
3
3
3
3
Assign the variable to lambda scope
0
1
2
3
This StackOverFlow thread explains better: https://crawler.algolia.com/admin/crawlers/189d20ee-337e-4498-8a4c-61238789942e/overview
In that case, I think that this is mainly a matter a documenting the right approach in the doc. wdyt?
I have marked this as a documentation effort.
I suggest that for the person who pick up this ticket, we can put a note section to warn about the scope of lambda, and check if we can improve our docs example, maybe add this to a faq.
This is something to tackle as part of #2941
Partitioned Dataset Lazy Saving
Problem
When using partitioned datasets with lazy saving in Kedro, following the current documentation example, the key value mapping is messed up. Specifically, the same value is saved for all keys.
Expected Behavior
Without lazy loading (lambda), everything works as expected, with keys and values correctly associated.
Example
Expand example (on click)
**Example dataset** ``` shell ! mkdir ../data/01_raw/bug ! parallel 'touch ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j ! parallel 'echo "{}" >> ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j ``` **Example catalog** ``` yml input_files: type: PartitionedDataset path: ../data/01_raw/bug/ dataset: pandas.CSVDataSet filename_suffix: .csv overwrite: True output_files: type: PartitionedDataset path: ../data/02_intermediate/bug/ dataset: pandas.CSVDataSet filename_suffix: .csv overwrite: True ``` **Example node** ``` py def copy_paste(input_loader): df = input_loader() return df def copy_paste_node(input_files): return {k: lambda: copy_paste(v) for k, v in input_files.items()} ``` **Run node** ``` py from kedro.io import DataCatalog # Create a data catalog from the configuration conf_catalog = yaml.safe_load(catalog) data_catalog = DataCatalog.from_config(conf_catalog) # data input input_files = data_catalog.load("input_files") # function call output_files = copy_paste_node(input_files) # data output data_catalog.save("output_files", output_files) ``` **Observe bug** ``` py from pathlib import Path filepaths = Path("../data/02_intermediate/bug/").glob("*.csv") for filepath in filepaths: print(filepath.name, filepath.read_text(), sep=": ") # a.csv: j -> Expected a.csv: a # c.csv: j -> Expected c.csv: c # b.csv: j -> Expected b.csv: b # f.csv: j -> ... # g.csv: j # e.csv: j # d.csv: j # i.csv: j # h.csv: j # j.csv: j ```Fix
Complete the lambda with arguments to make it aware of the actual value(s) passed to address the lambda scoping issue.
In our example, this means the following:
Environment
Resources
Slack discussion 🤗