feat: `topk` table expression

ianmcook commented 4 months ago

Is your feature request related to a problem?

As described here, filtering a table to return the row(s) with largest value(s) in each group feels harder in Ibis than in pandas. I wonder if Ibis could add some syntactic sugar to make this easier.

Describe the solution you'd like

dplyr has a function top_n() that makes this simpler syntactically:

df <- data.frame(
  country = c('India', 'India', 'India', 'United States', 'United States', 'United States', 'China', 'China', 'China'),
  city = c('Bangalore', 'Delhi', 'Mumbai', 'Los Angeles', 'New York', 'Chicago', 'Shanghai', 'Guangzhou', 'Beijing'),
  population = c(8443675, 11034555, 12442373, 3820914, 8258035, 2664452, 24281400, 13858700, 19164000)
)

library(dplyr)

df |> group_by(country) |> top_n(1, wt = population)

I wonder if we could add something like that in Ibis? Ibis already has a topk function, but it's a vector function, not a table function. Maybe Ibis could add a topk table function that translates into an operation like this?

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

[X] I agree to follow this project's Code of Conduct

deepyaman commented 4 months ago

We did explore exposing a shorthand previously (see https://github.com/ibis-project/ibis/issues/8574), but decided to just document a workaround until there was a request from the community. The documented solution (see https://ibis-project.org/tutorials/ibis-for-sql-users.html#top-k-operations) is quite similar to what you shared on the StackOverflow link. Agree that it's much more verbose than pandas.

This looks like something we should be able to implement a convenience wrapper for, though.

jitingxu1 commented 3 months ago

I could work on this one, if we want to have it. @cpcloud @jcrist

cpcloud commented 3 months ago

@jitingxu1 Can you describe the approach you're thinking about a bit?

jitingxu1 commented 3 months ago

I just went through the history but haven't had the chance to fully think it through yet. I've got a lot on my plate this week, so I might need to step back from this for now. If it's still available later, I can take a look then. @cpcloud

ibis-project / ibis