pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
19 stars 3 forks source link

Mixed per-user and team-shared pipeline runs #199

Open windiana42 opened 3 months ago

windiana42 commented 3 months ago

Sometimes, a pipeline uses very large inputs in the first stage which makes it run slowly and take up a lot of disk space. However, it would be nice if it is rather fast to try out new code on single tasks or stages. Pipedag already supports running just single tasks or stages. When running a single task, it is already possible that a user plays in the temporary schema avoiding to ever schema swap. However, sometimes it would be nice to also commit a stage "per-user" and then run tasks with input being a mixture of the per-user inputs and the team-shared inputs.

This issue is about implementing a mixed per-user/team-shared mode. In this case, inputs to running subgraphs would generally be fetched from the team-shared version if no such input exists in the per-user version. Temporary schemas and committed stage schemas should always reside per-user. So mostly dematerialization would have to be adapted.

Options:

  1. An advanced version of this idea could even do cache-invalidation checks on the team-shared instance, however, with some protection mechanism that prevents overwriting data in the team-shared instance.
  2. This issue could interact with #167 in a way that one could update information table by table in the per-user temp schema with multiple runs.
  3. It is even thinkable to allow mixed execution on two arbitrary pipeline instance configurations. Dominant use will probably still be per-user / team-shared instances of the same instance_id.