A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
In pipedag all tasks can communicate via the database. But due to slow JDBC/ODBC drivers or communication overhead, this may be slow. Thus two dataframe based tasks can also communicate by handing over output dataframes as input to following tasks directly. This allows starting the next task in parallel to writing the outputs to the database. The following stage commit, however, should wait until all input frames were persisted. For pandas 2.0 and polars it may be even possible to hand over the backing arrow dataframe without any copy.
Questions:
[ ] how can conflicts be avoided if two tasks read the same dataframe and may modify their input (in case a useful operating mode requires cooperation of the user, it should be disabled by default)
Features:
[ ] hand over dataframe and start next task in parallel to transfer to database
[ ] ensure zero copy handover of output to next task input for arrow backed pandas 2.0 / polars
[ ] make persistence to database of intermediate tables optional
=> interacts with the idea to store apache dataframes in shared memory process which acts as a cache layer between persistent database (might take care of parallel write back)
In pipedag all tasks can communicate via the database. But due to slow JDBC/ODBC drivers or communication overhead, this may be slow. Thus two dataframe based tasks can also communicate by handing over output dataframes as input to following tasks directly. This allows starting the next task in parallel to writing the outputs to the database. The following stage commit, however, should wait until all input frames were persisted. For pandas 2.0 and polars it may be even possible to hand over the backing arrow dataframe without any copy.
Questions:
Features: