pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

bug(): Complex tasks using imperative materialization break caching #203

Open DominikZuercherQC opened 2 weeks ago

DominikZuercherQC commented 2 weeks ago

@windiana42 My expectation is that even when using imperative materialization in a task it should stay cache valid if its inputs did not change. In complex cases this can break.

This PR serves as a minimal example:test_materialize.py::test_imperative_minimal_example

windiana42 commented 1 week ago

I can confirm the expectation. I need to debug to find the reason.

windiana42 commented 1 week ago

It takes a bit longer to fix because of implementing some groundwork for searching similar caching issues. The problem seems already found in the order of Table.assumed_dependencies:

    [<Table '_temp_res2_m3qtk77lrs6vpirivro5_0001' (stage)>, <Table '_temp_res2_m3qtk77lrs6vpirivro5_0000' (stage)>]

vs.

    [<Table '_temp_res2_m3qtk77lrs6vpirivro5_0000' (stage)>, <Table '_temp_res2_m3qtk77lrs6vpirivro5_0001' (stage)>]

The problem does not always surface which means that non-run-stable hashing is probably to blame.