run_after_node does not check the result after node

noklam commented 3 years ago

I have a function

def train_test_split():
    return train_x, train_y, val_x, val_y

When I use run_after_node=True, it will complain about "No such file or directory". Which means the expectation suite is actually executed before the file is output. It only works if the file exist before I do kedro run. Which mean it would be always checking the old version of data instead of the latest pipeline run.

I would assume expection suites can check against these variable directly through callbacks, instead of reading them again from file.

noklam commented 3 years ago

I dig into the source code and have more information now.

So Kedro Hook would actually have direct access of the inputs/outputs. We can learn about it from the NodeSpec

def after_node_run(node, catalog, inputs, outputs, is_async, run_id):
    ...

The bug is caused by wrong exeuction order.

Node exeuction
after_node_run() callback is triggered
after_node_run() try to access dataset from DataCatalog # It should have just access the variable, or it should only read after the file is updated
Kedro write data to files according to DataCatalog.

This can confirm by runner.py

def _run_node_sequential(node: Node, catalog: DataCatalog, run_id: str = None) -> Node:
    try:
        outputs = node.run(inputs)
    except Exception as error:
        hook_manager.hook.on_node_error(  # pylint: disable=no-member
            error=error,
            node=node,
            catalog=catalog,
            inputs=inputs,
            is_async=is_async,
            run_id=run_id,
        )
        raise error
    hook_manager.hook.after_node_run(  # pylint: disable=no-member
        node=node,
        catalog=catalog,
        inputs=inputs,
        outputs=outputs,
        is_async=is_async,
        run_id=run_id,
    )

    for name, data in outputs.items():
        catalog.save(name, data)
    return node

Kedro Hook after_node_run() is designed to be execute before writing out the file, I think it should just read from the callback argument outputs instead of reading from DataCatalog in this case.

This probably will not impact before_node_run, but only after_node_run

@tamsanh

tamsanh commented 3 years ago

Great work! Let me read through the pull request.

tamsanh commented 3 years ago

Excellent work. Let me know if the problem still exists.

tamsanh commented 3 years ago

I've published it under version 0.2.7

noklam commented 3 years ago

Thank you, the problem is resolved now.

tamsanh / kedro-great

run_after_node does not check the result after node #10