lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.53k stars 183 forks source link

Expose DataFusion extensions (plan nodes, plan optimizer rules) #1782

Open wjones127 opened 5 months ago

wjones127 commented 5 months ago

The code in dataset/scanner.rs has gotten extremely complicated, to a point where it is hard to test. Before we make any improvements, we need to refactor this to be easier to test and extend.

In addition, outside codebases may wish to extend Lance's capabilities by modifying or composing plans. For example, in LanceDB, we'll want to add a separate WAL that needs to be queried during KNN queries and scans.

Tasks

jayzhan211 commented 3 weeks ago

I'm interesting in improve datafusion extensibility for lance

wjones127 commented 3 weeks ago

I think a good approach to this would be starting to design some logical plan nodes implementing UserDefinedLogicalNode.

From there, we can create a create_plan_v2() method that creates a logical plan instead of the ExecutionPlan. Later then we can physical planner and other things. This would keep the existing planning intact until we are ready to switch things over.

I'm interesting in improve datafusion extensibility for lance

Is there a particular goal you have in mind that you want to work towards?

jayzhan211 commented 3 weeks ago

Is there a particular goal you have in mind that you want to work towards?

None yet. I would like to know how lance built on datafusion and what areas could be improved.

I think a good approach to this would be starting to design some logical plan nodes implementing UserDefinedLogicalNode.

Probably I could start from this! Create a LogicalPlan for KNN search

wjones127 commented 3 weeks ago

@jayzhan211 If you want a smaller issue to get started with, this might be a better one: https://github.com/lancedb/lance/issues/1927 It will also have a more immediate pay off.