pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Simplify debugging of materialization write-back #116

Open windiana42 opened 10 months ago

windiana42 commented 10 months ago

At end of tasks, the final materialization may throw errors caused by database systems which are hard to foresee in pipedag user code. It would simplify debugging of such problems significantly if materialization can be triggered by the user directly during interactive debugging. It should also be possible to support automatic overwriting of tables to allow calling the materialization repeatedly even after it succeeded in one of the interactive test materialization calls.

Databases may be limited in row-length or have trouble with one specific type constellation without reporting a good error message. Thus binary search is often the way to narrow down a problem. For dataframe writing, we could even automate this binary search for the column that breaks the materialization.

windiana42 commented 10 months ago

Another idea for this manual debugging materialization is optional checking of explicit dtypes. I.e. for String(n) columns: check that all strings to be written in this column have length <= n.