Closed hugo-quantmetry closed 3 years ago
But Ideally I would like do use the method context.run directly to benefit from functionalities such as hooks.
Have you tried passing extra_params
to the context? See https://kedro.readthedocs.io/en/latest/kedro.framework.context.KedroContext.html. This is what happens if you use kedro run --params ...
.
Just tried with the extra_params argument:
load_context(path_to_project, extra_params={"my_param": "param_from_extra"})
context.run(pipeline_name="my_pipeline")
Output : "param_from_extra" => It works :)
That solves the problem for parameters but how could I do that for a DataFrame?
I can pass a DataFrame as extra_params but the original value in parameters.yml
can't be a DataFrame oO
I'm not 100% sure I've understood your use case, but I believe adding it to catalog with replace=True
is the correct approach. If you just want this on run, you can do the replacements/additions in the before_pipeline_run
hook (I do this in https://github.com/deepyaman/kedro-accelerator/blob/develop/src/kedro_accelerator/plugins/__init__.py myself). Would that satisfy your use case? If I'm misunderstanding your need, please let me know.
Thanks for the reply, your solution works well to run pipeline in batch mode.
In my use-case I would like to run a kedro pipeline just like I execute a python function using dynamic arguments/datasets (Not known a priori, obtained at runtime).
I managed to do that using the Code API however I have the following questions:
I solved it by
io
attributecatalog.add
context = load_context(path_to_project)
dataset = MemoryDataSet(data="param_from_code")
my_catalog = context.io
my_catalog.add("params:my_param", dataset, replace=True)
print(my_catalog.load("params:my_param"))
# Run pipeline
my_pipeline = ...
SequentialRunner().run(my_pipeline, my_catalog)
SequentialRunner().run(my_pipeline, my_catalog)
then I don't access hookscontext.run(pipeline_name)
I do have hooks but I only manage to load the static datasets (the one defined in the yaml files). See Case 1 in my original post.I hope this makes my use-case clearer.
- If I use
context.run(pipeline_name)
I do have hooks but I only manage to load the static datasets (the one defined in the yaml files). See Case 1 in my original post.
For datasets, can you use this solution + before_pipeline_run
hook to replace catalog entries?
Also, as a disclaimer, I'm not a core Kedro team member and may be missing a simple solution somebody more familiar with the framework has. :)
Thanks for your ideas, I'm moving forward step by step ^^
Have a pipeline that works both in batch and online mode:
On top of that, I want my pipeline to run all the hooks
Ex: I set some hook attributes in the before_pipeline_run
hook and reuse those attributes in after_node_run
hooks. If before_pipeline_run
is not executed my pipeline fails.
My solution basically mimics what context.run
does but using a newly defined DataCatalog that takes datasets from both 1) function arguments and 2) datasets defined in the catalog.yml files.
This is fine but being able to use context.run
directly would make my code more resilient to Kedro updates (and less hacky).
def run_pipeline(pipeline_name, context, runner="sequential", arg_inputs=None):
catalog = DataCatalog()
yaml_catalog = context.io
pipeline = context._get_pipeline(name=pipeline_name)
hook_manager = get_hook_manager()
pipeline_inputs = pipeline.inputs()
if runner == "parallel":
runner = ParallelRunner()
else:
runner = SequentialRunner()
for input_name in pipeline_inputs:
copy_mode = None
if input_name in arg_inputs:
input_data = arg_inputs[input_name]
# SQL Alchemy engine needs "assign" mode
if type(input_data).__name__ == "Engine":
copy_mode = "assign"
elif input_name in yaml_catalog.list():
input_data = yaml_catalog.load(input_name)
else:
raise AttributeError(f"Missing input {input_name}")
dataset = MemoryDataSet(data=input_data, copy_mode=copy_mode)
catalog.add(input_name, dataset, replace=True)
run_params = {
"run_id": None,
"env": context.env,
"kedro_version": context.project_version,
"tags": None,
"from_nodes": None,
"to_nodes": None,
"node_names": None,
"from_inputs": None,
"load_versions": None,
"pipeline_name": pipeline_name,
"extra_params": context._extra_params,
}
hook_manager.hook.before_pipeline_run(
run_params=run_params, pipeline=pipeline, catalog=catalog
)
try:
run_result = runner.run(pipeline, catalog)
except Exception as exc:
hook_manager.hook.on_pipeline_error( # pylint: disable=no-member
error=exc,
run_params=run_params,
pipeline=pipeline,
catalog=catalog,
)
raise exc
hook_manager.hook.after_pipeline_run( # pylint: disable=no-member
run_params=run_params,
run_result=run_result,
pipeline=catalog,
catalog=catalog,
)
return run_result
I created a new catalog rather than replacing datasets in the context.io catalog because I needed my output datasets to be free outputs in order to be returned by the run_pipeline
function.
On a different topic, the copy_mode = "assign"
is a trick to pass SQLalchemy engine objects to Kedro pipelines. However it doesn't seem to work with the ParallelRunner, is there any recommended way to do that?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello!
Description
I have a parameter in a yaml file: my_param : param_from_yml
Case 1 : Trying to replace a dataset of the context.io catalog:
Output : "param_from_yml" The replacement did not occur even though the logs says
Case2: Trying to replace a dataset in a "copy" (reference to?) the context.io catalog:
Output : "param_from_code"
Context
I want to use the Code API to run a pipeline that combines:
I can do that using the code from "Case 2" above as follows:
But Ideally I would like do use the method
context.run
directly to benefit from functionalities such as hooks. For that purpose I need parameters/datasets values to be modified in context.ioIs there a way to do that?
Your Environment
Include as many relevant details about the environment in which you experienced the bug: