splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
388 stars 9 forks source link

Figure out what to do with `table_column` catalog table and bulk schema loading in general #475

Open gruuya opened 7 months ago

gruuya commented 7 months ago

Currently we're not really using our Schema for anything but the to_column_names_types call when persisting the columns to the table_column metadata table. So it's possible to remove that Schema altogether and just use the underlying arrow_schema call (though that could be extracted to a separate function).

On a more general level, we also currently don't use anything from our table_column catalog table. When fetching a schema for a given table, such as in information_schema.columns or when calling TableProvider::schema somewhere in code (which is what DF uses for information_schema.columns queries internally as well), we always rely on the Delta table's schema, which is ultimately reconstructed from the logs. The information_schema.columns in particular will pose a problem at some point, see here https://github.com/splitgraph/seafowl/blob/40b1158a90121422e66acbc66e4d536f6081b6d7/src/catalog.rs#L285-L293

The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in TableProvider::schema, and only resolve TableProvider::scans using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.

There's also a minor matter of format; currently we store the fields using the unofficial arrow json representation, while our storage layer has it's own schema/field types. There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.

gruuya commented 7 months ago

I've also come to realize that the unofficial json representation is probably not robust/forward-compatible enough, and we should probably just migrate to serde::Serialize/Deserialize for the Schema/Field, which is not equivalent: https://github.com/apache/arrow-rs/issues/2876