Closed melvinkokxw closed 3 months ago
Here is a temporary fix, by creating all the catalog entries required before the pipeline starts running.
class ResolveDatasetsHooks:
@hook_impl
def before_pipeline_run(self, pipeline, catalog):
data_sets = set()
for node in pipeline.nodes:
data_sets.update(node.outputs)
data_sets.update(node.inputs)
for ds in data_sets:
catalog._get_dataset(ds)
Hey @melvinkokxw, love that you provide a clean script instead of a scaffold project, it's very easy for me to run this, appreciate your effort a lot ✨!
I suspect this is related to:
Can you try to change {name}
-> {abc}
? I try to change the Runner to SequentialRunner
which is still failing, so maybe there is something wrong in the script. After I change the {name}
, I get different error message and that may solve your problem already.
I manage to run this successfully, it is more of a problem of your script. Was it copied from old version of Kedro? Can you explain a little bit what you are trying to do? Maybe that will give us more context to come up with a better solution.
There are few problems:
type
key is not intent properly.if __name__ == '__main__'
, which is a python thing when deal with multi-process/threadinput_df
doesn't exist, you cannot use default dataset with no data, I monkeypatch this with a lambda
function to bypass the MemoryDataset checking.import yaml
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, node
from kedro.runner import ThreadRunner
from kedro.runner.parallel_runner import ParallelRunner
from kedro.runner.sequential_runner import SequentialRunner
if __name__ == "__main__":
catalog_yml = """
"{name}":
type: MemoryDataset
"""
from kedro.io.memory_dataset import MemoryDataset
MemoryDataset._load = lambda x: print("lambda!")
catalog = yaml.safe_load(catalog_yml)
io = DataCatalog.from_config(catalog)
def return_dataframe(input_df):
return "return!"
pipeline = Pipeline(
[
node(
func=return_dataframe, inputs="input_df", outputs="output_df1", name="node1"
),
node(
func=return_dataframe, inputs="input_df", outputs="output_df2", name="node2"
),
]
)
runner = ThreadRunner()
# runner = ThreadRunner()
runner.run(pipeline, io)
I am closing this issue due to no activity, I tried to reproduce this last time and it work expected. Please reopen with an valid example.
Description
Using
ThreadRunner
with dataset factories leads to aDatasetAlreadyExistsError
Context
I have a pipeline that has two nodes using the same input, both inputs should be loaded using dataset factories. When using
ThreadRunner
with my pipeline,kedro
throws aDatasetAlreadyExistsError
.Steps to Reproduce
Here is a minimal reproducible example:
Expected Result
Pipeline should run successfully with no errors
Actual Result
Full error logs here
``` ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ inYour Environment
pip show kedro
orkedro -V
): 0.18.14, also reproducible on 0.19.3python -V
): 3.9.18