tamsanh / kedro-great

The easiest way to integrate Kedro and Great Expectations
MIT License
52 stars 14 forks source link

run_after_node does not check the result after node #10

Closed noklam closed 3 years ago

noklam commented 3 years ago

I have a function

def train_test_split():
    return train_x, train_y, val_x, val_y

When I use run_after_node=True, it will complain about "No such file or directory". Which means the expectation suite is actually executed before the file is output. It only works if the file exist before I do kedro run. Which mean it would be always checking the old version of data instead of the latest pipeline run.

I would assume expection suites can check against these variable directly through callbacks, instead of reading them again from file.

noklam commented 3 years ago

I dig into the source code and have more information now.

So Kedro Hook would actually have direct access of the inputs/outputs. We can learn about it from the NodeSpec

def after_node_run(node, catalog, inputs, outputs, is_async, run_id):
    ...

The bug is caused by wrong exeuction order.

  1. Node exeuction
  2. after_node_run() callback is triggered
  3. after_node_run() try to access dataset from DataCatalog # It should have just access the variable, or it should only read after the file is updated
  4. Kedro write data to files according to DataCatalog.

This can confirm by runner.py

def _run_node_sequential(node: Node, catalog: DataCatalog, run_id: str = None) -> Node:
    try:
        outputs = node.run(inputs)
    except Exception as error:
        hook_manager.hook.on_node_error(  # pylint: disable=no-member
            error=error,
            node=node,
            catalog=catalog,
            inputs=inputs,
            is_async=is_async,
            run_id=run_id,
        )
        raise error
    hook_manager.hook.after_node_run(  # pylint: disable=no-member
        node=node,
        catalog=catalog,
        inputs=inputs,
        outputs=outputs,
        is_async=is_async,
        run_id=run_id,
    )

    for name, data in outputs.items():
        catalog.save(name, data)
    return node

Kedro Hook after_node_run() is designed to be execute before writing out the file, I think it should just read from the callback argument outputs instead of reading from DataCatalog in this case.

This probably will not impact before_node_run, but only after_node_run

@tamsanh

tamsanh commented 3 years ago

Great work! Let me read through the pull request.

tamsanh commented 3 years ago

Excellent work. Let me know if the problem still exists.

tamsanh commented 3 years ago

I've published it under version 0.2.7

noklam commented 3 years ago

Thank you, the problem is resolved now.