pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
19 stars 3 forks source link

Retry of producing a stage output #167

Open windiana42 opened 6 months ago

windiana42 commented 6 months ago

We currently can only retrieve results from cache for tables that managed to get through a stage commit (i.e. schema swapping procedure). However, in case of really large tables, it might be nice to just continue getting a stage output produced where it stopped in a previous crash. We do cache invalidation on table level. So it would be possible to track that. The tricky aspect would be that we basically need to validate that everything in the transaction schema is cache valid up to one point and then continue from there.

There is a workaround for this problem. One can wrap every table with its own nested Stage. This actually achieves pretty much the desired result as described above. However, in the end, it would be nice if all tables were still located within one schema. The other downside of the workaround is that in fact the schema swapping of the surrounding Stage is made useless since each table has a separate stage commit and overwrites its cache one by one.