Open wjones127 opened 5 months ago
What is the current status for this? I'm interested in helping out with #2100, but wondering if you need any assistance with this first?
I haven't had the bandwidth to do this, but it is high on our priority list since it blocks auto-compaction and such. You are welcome to take it on if you'd like.
I'll take a look and see how I go. Did you envision this metadata being stored as protobuf or JSON?
I would expect a field in the Manifest
message: https://github.com/lancedb/lance/blob/main/protos/table.proto
We'd like to store configuration options for the table. For example, when we support auto-compaction, we need a place in table to store compaction parameters. Another example from an existing issue (#1206): if dynamodb is configured for commit lock, we'd like to record that as a table config so it can be enforced.
Comparison with existing metadata
There are two existing metadata fields: schema metadata and version metadata. This is different than schema metadata in that it's meant for configuration about the table, not metadata about the data itself. It should not be propagated to the Arrow record batches read off of the table.
This is different from version metadata in that it is meant to be propagated to future versions. Meanwhile version metadata is meant to be specific to a particular version.
Configuration format
The simplest thing is to have a key value format. The keys and values can be UTF-8 strings.
Reserved keys
Lance will reserve some keys for it's own use as configuration options. These will be keys that start with
lance:
.Propagation
When a transaction adds, updates, or removes a metadata field, that change must be propagated by future transactions. This includes concurrent transactions. To make this possible, we need to change the conflict resolution code paths to propagate these changes. Writers who cannot do this shouldn't be allowed to write to tables with these keys filled in.
Therefore, we'll need to introduce a new writer feature flag for this. Once some metadata is added, that flag will be active in the table. There is already a mechanism preventing older writers from writing to a dataset with an unknown flag enabled.