ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.35k stars 600 forks source link

feat: `topk` table expression #9540

Open ianmcook opened 4 months ago

ianmcook commented 4 months ago

Is your feature request related to a problem?

As described here, filtering a table to return the row(s) with largest value(s) in each group feels harder in Ibis than in pandas. I wonder if Ibis could add some syntactic sugar to make this easier.

Describe the solution you'd like

dplyr has a function top_n() that makes this simpler syntactically:

df <- data.frame(
  country = c('India', 'India', 'India', 'United States', 'United States', 'United States', 'China', 'China', 'China'),
  city = c('Bangalore', 'Delhi', 'Mumbai', 'Los Angeles', 'New York', 'Chicago', 'Shanghai', 'Guangzhou', 'Beijing'),
  population = c(8443675, 11034555, 12442373, 3820914, 8258035, 2664452, 24281400, 13858700, 19164000)
)

library(dplyr)

df |> group_by(country) |> top_n(1, wt = population)

I wonder if we could add something like that in Ibis? Ibis already has a topk function, but it's a vector function, not a table function. Maybe Ibis could add a topk table function that translates into an operation like this?

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

deepyaman commented 4 months ago

We did explore exposing a shorthand previously (see https://github.com/ibis-project/ibis/issues/8574), but decided to just document a workaround until there was a request from the community. The documented solution (see https://ibis-project.org/tutorials/ibis-for-sql-users.html#top-k-operations) is quite similar to what you shared on the StackOverflow link. Agree that it's much more verbose than pandas.

This looks like something we should be able to implement a convenience wrapper for, though.

jitingxu1 commented 3 months ago

I could work on this one, if we want to have it. @cpcloud @jcrist

cpcloud commented 3 months ago

@jitingxu1 Can you describe the approach you're thinking about a bit?

jitingxu1 commented 3 months ago

I just went through the history but haven't had the chance to fully think it through yet. I've got a lot on my plate this week, so I might need to step back from this for now. If it's still available later, I can take a look then. @cpcloud