pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Store categorical variables with automatic shadow table in relational DB #74

Open windiana42 opened 1 year ago

windiana42 commented 1 year ago

Categorical columns in pandas could be materialized to / dematerialized from relational database tables if there is another table that includes the mapping from categorical IDs to actual strings.

It would even be possible to have functions for programmatically created SQL which automate the resolution via one join per categorical column.

windiana42 commented 1 year ago

=> Might interact with the idea to store row numbers such that two dataframe tasks get their data in same order