ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.08k stars 586 forks source link

feat(selectors): preserve order when using the `c` selector #10013

Open cpcloud opened 2 weeks ago

cpcloud commented 2 weeks ago

Right now, no matter what kind of selector is being used, the expanded columns are in the order of the table they are matched against.

For selectors where no obvious column order exists when they are specified, which I believe is most of them, this make sense.

I'm wondering whether we should try to preserve order when the c selector is in the mix.

Here's an example of what currently happens:

In [1]: from ibis.interactive import *

In [2]: t = ibis.memtable({"x": ['a'], "y": ['b']})

In [3]: t
Out[3]:
┏━━━━━━━━┳━━━━━━━━┓
┃ x      ┃ y      ┃
┡━━━━━━━━╇━━━━━━━━┩
│ string │ string │
├────────┼────────┤
│ a      │ b      │
└────────┴────────┘

In [4]: t.select(s.c('y', 'x'))
Out[4]:
┏━━━━━━━━┳━━━━━━━━┓
┃ x      ┃ y      ┃
┡━━━━━━━━╇━━━━━━━━┩
│ string │ string │
├────────┼────────┤
│ a      │ b      │
└────────┴────────┘

This came up during writing TPC-DS query 18 where I want to select across c in the agg method, and the specification of the columns I gave was different from the input table, so the output wasn't what I expected.

Let's avoid discussing implementation until we decide we want to do this.

gforsyth commented 2 weeks ago

I can understand wanting the built in relocate functionality as a nice shorthand.

Would we attempt to also allow this style of ordering when the c selector is combined with others?

>>> t.select(s.c("body_mass_g") | s.numeric())
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┓
┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ year  ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━┩
│ float64        │ float64       │ int64             │ int64       │ int64 │
├────────────────┼───────────────┼───────────────────┼─────────────┼───────┤
...

Should that bump body_mass_g to the front?

cpcloud commented 2 weeks ago

I'm going to see what dplyr does.

cpcloud commented 2 weeks ago

It looks like dplyr takes an eager approach and short circuits the selector match, resulting in behavior where the output order follows the first matching selector:

setup

> library(dplyr)
> t <- as_tibble(data.frame(x=c('a'), y=c('b')))
> t
# A tibble: 1 × 2
  x     y
  <chr> <chr>
1 a     b

c(y, x)

> t |> select(c(y, x))
# A tibble: 1 × 2
  y     x
  <chr> <chr>
1 b     a

This output matches the selector order

!where(is.numeric)

!where(is.numeric) matches both columns first, so the order is preserved.

> t |> select(!where(is.numeric) | c(y, x))
# A tibble: 1 × 2
  x     y
  <chr> <chr>
1 a     b
gforsyth commented 1 week ago

I think @jcrist had some thoughts on this one