Open gruuya opened 7 months ago
I've also come to realize that the unofficial json representation is probably not robust/forward-compatible enough, and we should probably just migrate to serde::Serialize/Deserialize
for the Schema
/Field
, which is not equivalent: https://github.com/apache/arrow-rs/issues/2876
Currently we're not really using our
Schema
for anything but theto_column_names_types
call when persisting the columns to thetable_column
metadata table. So it's possible to remove thatSchema
altogether and just use the underlyingarrow_schema
call (though that could be extracted to a separate function).On a more general level, we also currently don't use anything from our
table_column
catalog table. When fetching a schema for a given table, such as ininformation_schema.columns
or when callingTableProvider::schema
somewhere in code (which is what DF uses forinformation_schema.columns
queries internally as well), we always rely on the Delta table's schema, which is ultimately reconstructed from the logs. Theinformation_schema.columns
in particular will pose a problem at some point, see here https://github.com/splitgraph/seafowl/blob/40b1158a90121422e66acbc66e4d536f6081b6d7/src/catalog.rs#L285-L293The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in
TableProvider::schema
, and only resolveTableProvider::scan
s using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.There's also a minor matter of format; currently we store the fields using the unofficial arrow json representation, while our storage layer has it's own schema/field types. There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.