pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
19 stars 3 forks source link

Allow running tasks on active schema (Big DANGER mode disclaimer) #192

Open windiana42 opened 5 months ago

windiana42 commented 5 months ago

Sometimes schema swapping is causing large friction. Assume large tables were produced in a stage and we just want to replace one of them. Even the cache valid situation has to copy the content of a table to the temp schema if the stage is not 100% cache valid. #167 already addresses the problem that currently, work done in the temp schema is lost in case the stage code needs to be adjusted before schema can be swapped.

This issue has the idea that a flow run could be configured to work directly in the active schema. For example flow.run([task], in_active=True) could produce a table next to the target table and then swap tables by renaming (similar to the *__copy technique already implemented). This feature should output a big warning since transactional properties of the stage will not be working properly. Also @input_stage_versions tasks will have problems to cope with this run mode.