lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.78k stars 207 forks source link

Add a table metadata field to table format #2200

Open wjones127 opened 5 months ago

wjones127 commented 5 months ago

We'd like to store configuration options for the table. For example, when we support auto-compaction, we need a place in table to store compaction parameters. Another example from an existing issue (#1206): if dynamodb is configured for commit lock, we'd like to record that as a table config so it can be enforced.

Comparison with existing metadata

There are two existing metadata fields: schema metadata and version metadata. This is different than schema metadata in that it's meant for configuration about the table, not metadata about the data itself. It should not be propagated to the Arrow record batches read off of the table.

This is different from version metadata in that it is meant to be propagated to future versions. Meanwhile version metadata is meant to be specific to a particular version.

Configuration format

The simplest thing is to have a key value format. The keys and values can be UTF-8 strings.

Reserved keys

Lance will reserve some keys for it's own use as configuration options. These will be keys that start with lance:.

Propagation

When a transaction adds, updates, or removes a metadata field, that change must be propagated by future transactions. This includes concurrent transactions. To make this possible, we need to change the conflict resolution code paths to propagate these changes. Writers who cannot do this shouldn't be allowed to write to tables with these keys filled in.

Therefore, we'll need to introduce a new writer feature flag for this. Once some metadata is added, that flag will be active in the table. There is already a mechanism preventing older writers from writing to a dataset with an unknown flag enabled.

dsgibbons commented 1 month ago

What is the current status for this? I'm interested in helping out with #2100, but wondering if you need any assistance with this first?

wjones127 commented 4 weeks ago

I haven't had the bandwidth to do this, but it is high on our priority list since it blocks auto-compaction and such. You are welcome to take it on if you'd like.

dsgibbons commented 4 weeks ago

I'll take a look and see how I go. Did you envision this metadata being stored as protobuf or JSON?

wjones127 commented 4 weeks ago

I would expect a field in the Manifest message: https://github.com/lancedb/lance/blob/main/protos/table.proto