Add support for dictionary encoded fields to the v2 reader/writer

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.73k stars 202 forks source link

Add support for dictionary encoded fields to the v2 reader/writer #2347

Open westonpace opened 3 months ago

westonpace commented 3 months ago

Dictionary encoded fields would presumably keep their dictionary encoding when writing (though this will technically be at the leisure of the decoder picker which will be an extension point)

On reading, dictionary encoding will be treated a little differently than previous file formats.

A user can ask for a field to be read in as a dictionary and the decoder will dictionary encode it even if it wasn't originally in that encoding.
A user can ask for a field in plan encoding and the decoder will dictionary decode it even if it was originally dictionary encoded.
We will need to come up with some kind of mechanism for the "default projection"
- Should we only return dictionary encoded form if the field was written in that form? Or do we try and detect, at read time, that dictionary decoding would be applicable (I think not but it is a possibility)
- Should we have a "prefer_simple_encodings" option which never reads into dictionary, even if the encoded form is dictionary?

Input: ["abc", "def", "abc", "ghi", "def"]

Output:

Dictionary: ["abc", "def", "ghi"]
Indices: [0, 1, 0, 2, 1]

Tasks:

[x] Detect if dictionary encoding is appropriate (e.g. some kind of "unique values" approximation)
[x] Encode task to actually encode the data and write out dictionary encoded buffers
- [ ] Should this be array encoder or a field encoder?
- [ ] Should the dictionary go be a page buffer or a column metadata buffer? (probably want column metadata to start with)
- [ ] Can even do something exotic like put the entire column dictionary into the last page
[x] Decode task
- [ ] If we assume dictionary is in column metadata then decode dictionary as part of opening file
- [ ] Can schedule easily
- [ ] Decode is fairly straightforward

Note: when benchmarking, should see significant difference between initial reads and consecutive reads where metadata is cached.

wjones127 commented 3 months ago

In Arrow Parquet implementations, the solution for the mapping was to serialize the preferred Arrow schema as a metadata field.

westonpace commented 3 months ago

In Arrow Parquet implementations, the solution for the mapping was to serialize the preferred Arrow schema as a metadata field.

That's how the default projection works today. When writing the file we save the schema given by the user. When reading, we use that schema if no projection is given.

I think we want to get to a place, in writing, where dictionary encoding is automatic though. So we will detect if a column would benefit from dictionary encoding and then use it automatically (parquet does this too I believe). So then what do we do on read? Especially if all of the pages for a column ended up being eligible for dictionary encoding. We could read those back into a dictionary encoded field even though the user never asked for it.

This is probably more important for RLE where we could potentially benefit by keeping the batches in RLE form and running compute on the RLE form of the batches. Dictionary encoding is less useful since most compute functions do not have dictionary encoded kernels.

niyue commented 3 months ago

Should we have a "prefer_simple_encodings" option which never reads into dictionary, even if the encoded form is dictionary

This may be useful in some cases, for example, some frameworks like Gandiva, is not capable of handling dictionary encoded arrays, which means decoding dictionary encoded array into simple encoding array is required when working with these frameworks.

niyue commented 2 months ago

I'm wondering if the current design proposal allows for reading partial and minimal dictionary tables to minimize I/O. In the example above, if only rows [0, 2] are requested, can we read just the 0-th entry ['abc'] from the dictionary instead of loading the entire dictionary buffer from the disk? Thanks.

westonpace commented 2 months ago

It could maybe be done if you were willing to make two back to back I/O requests (one to get indices then one to get dictionary value). This may or may not be faster (would probably depend on the size of the dictionary and the number of rows requested)

I believe @raunaks13 is going to be looking at some dictionary encoding this week.

For our (LanceDb's) use cases I expect that the dictionaries will be small enough to fit into the column metadata buffers (and read and parsed when the file is opened) so that we can cache them. Then we can selectively read the targeted indices.

niyue commented 1 month ago

@westonpace In the lance v2 format, we currently support plain encoding, dictionary encoding, and general compression. At present, plain encoding is the default option. General compression can be enabled via an environment variable, while dictionary encoding is automatically enabled based on field cardinality calculations.

With additional encodings like FSST in development, do we have any plans to allow users to explicitly specify the encoding method for each field, so that API callers can have more control over this behavior? Thank you.

westonpace commented 1 month ago

@niyue

Users can control this with the FieldEncodingStrategy trait today but that has no python bindings and is overkill.

There is some evolution of the core encoding strategy. Dictionary is applied when it seems the data has low cardinality. FSST will possibly have a similar criteria (attempt FSST and only use the compressed if it is smaller).

Other encodings, like general compression, are more of a tradeoff (compute vs IO). A "packed struct" encoding is currently in development as well (https://github.com/lancedb/lance/pull/2593) This encoding will never be applied automatically and must be specifically requested by the user. I have currently suggested that we use field metadata to make the decision. I think field metadata is probably a good solution.

From a LanceDB perspective (not just lance file), I imagine users will define expected column behaviors in the "table schema" (e.g. in the pydantic binding). In other words, if a column is regularly used in filters then they should mark it "indexed" in the table schema and LanceDB will create a secondary index. Similarly, if a nested field is expected to always be accessed in its entirety, then it should be marked "packed" in the table schema and LanceDB will set the field metadata appropriately to use the packed struct encoding.

niyue commented 1 month ago

@westonpace

I have currently suggested that we use field metadata to make the decision

Do you think such field metadata can be used to control whether other encodings (not just packed struct) should be applied? For example, dictionary encoding, FSST, or general compression could all be used for encoding a string field's data. Users may have some domain knowledge (e.g., cardinality, max length) that helps them understand the best encoding for such a field. It would be great if they could somehow control this behavior explicitly using their domain knowledge.

Or do you expect this encoding choice to be opaque to users, with the system internally deciding the encoding for a field, potentially using multiple mixed encodings for the same field (if data is partitioned), similar to btrblocks? Thanks.

westonpace commented 1 month ago

Yes, I expect there will be cases where users will have advanced knowledge of their data and want specific encodings. Some examples:

The "packed struct" I mentioned earlier will be chosen sometimes based on the expected query patterns.
Users will sometimes have a dictionary array pre-calculated

I think field metadata is the best spot for users to specify these things. I think of these less as "encoding hints" and more as "extension types"

Or do you expect this encoding choice to be opaque to users, with the system internally deciding the encoding for a field, potentially using multiple mixed encodings for the same field (if data is partitioned), similar to btrblocks?

Yes, this is also true. I don't think these concepts are exclusive.

For example, a user has a pre-calculated dictionary, and so they supply the input as a dictionary array. Lance might still decide to use a different encoding for a particular page/chunk if it's appropriate. For example, a chunk might be constant and so Lance uses a "constant encoding" or "RLE encoding" instead of the regular dictionary encoding for that chunk.

Also, by default, Lance will give back the data in the format the user provided it on write. However, it was designed to allow for the read-encoding to be specified as well. For example, a user could store a column with dictionary encoding and then specify they want plain encoding when reading. Lance should dictionary decode on read. These APIs do not exist yet in Python though.

Long term I think some query engines may evolve to handle multiple encodings for a column. In other words, the schema is not fixed for a given stream of data and might change from batch to batch (the logical "types" are consistent but the encodings vary).

niyue commented 1 month ago

Thanks so much for the explanation.

I think of these less as "encoding hints" and more as "extension types"

Could you elaborate more on this part? Do you mean that if we support using field metadata to specify these settings, it will be hints like "a row-major accessed struct" or "low cardinality field" instead of "packed struct" or "dictionary encoded field"?