lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.99k stars 230 forks source link

Allow scalar indices to be used in distinct queries #2719

Open wjones127 opened 3 months ago

wjones127 commented 3 months ago

Right now, the recomended way to get distinct values of columns is to use duckdb:

rows = duckdb.query("SELECT distinct <my column name> FROM lance_dataset")

However, if we have scalar indices on that column, we could execute the query much more quickly. We likely couldn't do that through the DuckDB integration, but we could do it within a DataFusion query easily.

wjones127 commented 3 months ago

To implement this, first we would need to expose an API like this:

import lance

dataset = lance.dataset("test")
lance.sql("SELECT * FROM table", table=dataset).to_table()

That should be straightfoward since we already have a table provider.

Then, we would need: