pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.43k stars 1.97k forks source link

`selectors` should support slicing columns #15963

Open samukweku opened 6 months ago

samukweku commented 6 months ago

Description

Hi team. I would like to suggest adding a slice method to the selectors class, where users can select a slice of columns :

import polars as pl

data = {'City': ['Houston', 'Austin', 'Hoover'],
 'State': ['Texas', 'Texas', 'Alabama'],
 'Name': ['Aria', 'Penelope', 'Niko'],
 'Mango': [4, 10, 90],
 'Orange': [10, 8, 14],
 'Watermelon': [40, 99, 43],
 'Gin': [16, 200, 34],
 'Vodka': [20, 33, 18]}

df = pl.DataFrame(data)

df

┌─────────┬─────────┬──────────┬───────┬────────┬────────────┬─────┬───────┐
│ City    ┆ State   ┆ Name     ┆ Mango ┆ Orange ┆ Watermelon ┆ Gin ┆ Vodka │
│ ---     ┆ ---     ┆ ---      ┆ ---   ┆ ---    ┆ ---        ┆ --- ┆ ---   │
│ str     ┆ str     ┆ str      ┆ i64   ┆ i64    ┆ i64        ┆ i64 ┆ i64   │
╞═════════╪═════════╪══════════╪═══════╪════════╪════════════╪═════╪═══════╡
│ Houston ┆ Texas   ┆ Aria     ┆ 4     ┆ 10     ┆ 40         ┆ 16  ┆ 20    │
│ Austin  ┆ Texas   ┆ Penelope ┆ 10    ┆ 8      ┆ 99         ┆ 200 ┆ 33    │
│ Hoover  ┆ Alabama ┆ Niko     ┆ 90    ┆ 14     ┆ 43         ┆ 34  ┆ 18    │
└─────────┴─────────┴──────────┴───────┴────────┴────────────┴─────┴───────┘

The slicing syntax can be :

df.select(cs.slice('Mango','Vodka')) # alternative - df.select(cs['Mango':'Vodka'])
shape: (3, 5)
┌───────┬────────┬────────────┬─────┬───────┐
│ Mango ┆ Orange ┆ Watermelon ┆ Gin ┆ Vodka │
│ ---   ┆ ---    ┆ ---        ┆ --- ┆ ---   │
│ i64   ┆ i64    ┆ i64        ┆ i64 ┆ i64   │
╞═══════╪════════╪════════════╪═════╪═══════╡
│ 4     ┆ 10     ┆ 40         ┆ 16  ┆ 20    │
│ 10    ┆ 8      ┆ 99         ┆ 200 ┆ 33    │
│ 90    ┆ 14     ┆ 43         ┆ 34  ┆ 18    │
└───────┴────────┴────────────┴─────┴───────┘
aut0clave commented 6 months ago

If you know what fields you want, why do you need a selector? Why not use a simple .select("Mango","Vodka")? Or the existing cs.by_name("Mango","Vodka")?

cmdlineluser commented 6 months ago

@aut0clave They want to extract the "range of columns" Mango .. Vodka

I believe first/last are the only selectors that are "positional"

>>> cs.first().meta.serialize()
'{"Nth":0}'

There is no .nth() selector, but it would be easy to add:

>>> df.select( pl.Expr.deserialize( io.StringIO("""{"Nth":3}""") ) )
shape: (3, 1)
┌───────┐
│ Mango │
│ ---   │
│ i64   │
╞═══════╡
│ 4     │
│ 10    │
│ 90    │
└───────┘

nth -> column name mapping is done here:

https://github.com/pola-rs/polars/blob/4b23768a7e0b50e39a0c5df8e33321e9b94b6387/crates/polars-plan/src/logical_plan/expr_expansion.rs#L67

From what I can tell, there is nothing that goes the other way, i.e. column name -> nth - which I think would be needed in order to support this at the selector level?

samukweku commented 6 months ago

@cmdlineluser i'd assume there was a way to get the positions of the column names (maybe grab the positions via list.index from python and pass it to the rust end). dont know much about the internal implementation, happy to learn. I'd also suggest, if the team feels like this is a worthwhile addition, that the slicing be limited to column names only (numeric positions should not be supported)

alexander-beedie commented 6 months ago

@cmdlineluser i'd assume there was a way to get the positions of the column names (maybe grab the positions via list.index from python and pass it to the rust end).

FYI: until we are actually evaluating a lazy query plan we may not know the position of all of the columns (eg: expanding a struct, or evaluating earlier selectors). Consequently we can't precompute and pass-down, because it's only at the lower level that we would know the answer (selectors are dynamic, evaluating internally at the point they are invoked) ;)

Offering index-based selection doesn't seem like a bad idea (we currently only support selection by name/dtype and the special cases of first/last, as noted by @cmdlineluser), but would need some internal additions to be possible 🤔

samukweku commented 6 months ago

@cmdlineluser so something like cs.by_position, cs.by_range?

cmdlineluser commented 6 months ago

@alexander-beedie is the person to ask. (they created selectors :-D)

alexander-beedie commented 6 months ago

@cmdlineluser so something like cs.by_position, cs.by_range?

Probably cs.by_index, which would take one or more index values, a range, or a slice (as range/slice can be directly expanded into a list of indexes, so internally we just need to handle that). Does need additional low-level support though.

alexander-beedie commented 4 months ago

FYI, forgot to update this issue, but we do now have a new cs.by_index selector which can take indices and ranges, which gets you some of the way there: https://github.com/pola-rs/polars/pull/16217

samukweku commented 4 months ago

Thanks @alexander-beedie. Looks good. Safe to assume that slicing with labels may be implemented at a future date?

alexander-beedie commented 4 months ago

Thanks @alexander-beedie. Looks good. Safe to assume that slicing with labels may be implemented at a future date?

Probably, but no timeline; the 1.0 (and a few quick point releases to address any related issues) has priority at the moment. And I'm on vacation for the next two weeks ;)