pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Forward dataframes between tasks in parallel to SQL writeback #76

Open windiana42 opened 1 year ago

windiana42 commented 1 year ago

In pipedag all tasks can communicate via the database. But due to slow JDBC/ODBC drivers or communication overhead, this may be slow. Thus two dataframe based tasks can also communicate by handing over output dataframes as input to following tasks directly. This allows starting the next task in parallel to writing the outputs to the database. The following stage commit, however, should wait until all input frames were persisted. For pandas 2.0 and polars it may be even possible to hand over the backing arrow dataframe without any copy.

Questions:

Features:

windiana42 commented 1 year ago

=> interacts with the idea to store apache dataframes in shared memory process which acts as a cache layer between persistent database (might take care of parallel write back)