pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

Database-style indexes for efficient filtering? #7335

Open gwerbin opened 1 year ago

gwerbin commented 1 year ago

Problem description

Apologies if this is flatly out of scope or otherwise based in misunderstandings of how Polars works.

I am working on an application where I will need to perform repeated range lookups on a Datetime column in a fairly large data frame, and I am thinking about using Polars for this project instead of Pandas.

I am concerned that these repeated range lookups could get very expensive when accumulated over the total running time of the application. Traditional relational databases like PostgreSQL offer several index types that can be used to avoid full table scans. And Pandas' "index-as-row-labels" system (at least in theory) can implement internal optimizations for these kinds of operations.

Understanding of course that Polars has chosen not to adopt Pandas' system, is there any plan to support some kind of external index system for Polars data frames? A hypothetical Python API could look something like this:

from datetime import datetime as DateTime
import polars as pl

data: pl.DataFrame = ...  # big data frame
t1 = DateTime(2022, 1, 1)
t2 = DateTime(2022, 3, 1)

result1 = data.filter(
    (pl.col('start_time') >= t1) & (pl.col('start_time') < t2)
)

start_time_index = pl.build_index(data, ['start_time'], method='btree')
result2 = data.filter(
    (start_time_index >= t1) & (start_time_index < t2)
)

I could see this also being helpful when working with geospatial data.

Of course, we should not try to optimize heavily without having measured/profiled things first. And a system like this would inherit all of the problems associated with database indexes (need to be updated, not helpful with low cardinality, etc). But I'd be surprised if this couldn't end up as a bottleneck in certain applications.

Is this something that would be considered in-scope for the Polars project? Or is this a niche enough use case that most users aren't expected to need it, and if needed would be expected to craft or bring in their own indexes (e.g. find a suitable B-tree or R-tree library in the host language)?

Edit: in the particular example above, it would make sense to sort this data by start_time. Would .sort('start_time').rechunk() be helpful for this? I imagine you could also cook something up with .groupby_dynamic or even .partition_by depending on the exact usage.

zundertj commented 1 year ago

Polars can already lever that data is sorted, see for instance this excellent blog post: https://braaannigan.github.io/software/2022/09/13/polars-with-set-sorted.html

MiguelAngelTorres commented 4 months ago

The blog post is perfect while working with a single column filter. There is a way to make it work with multiple columns? For example, using polars to execute "<" and ">" operations to build a decision tree you might want to efficiently query any column.

cmdlineluser commented 4 months ago

@MiguelAngelTorres I think that may fall under the "self non-equi join" category?

I don't think anything currently exists to help with that.