h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
320 stars 85 forks source link

Update groupby-juliadf.jl #212

Closed monopolynomial closed 3 years ago

monopolynomial commented 3 years ago

similar to pandas code for doing the task it's faster than the current implementation.

bkamins commented 3 years ago

If we allow splitting the operation into two steps then this should be faster (and have a correct output structure, as the proposed one has two extra columns):

select!(combine(groupby(x, :id3), :v1 => maximum∘skipmissing => :v1, :v2 => maximum∘skipmissing => :v2),
        :id3, [:v1, :v2] => ((v1, v2) -> v1 - v2) => :range_v1_v2)

However, the question is if we want to allow for this as also other solutions would probably benefit from a similar change.

jangorecki commented 3 years ago

Yes, for data.table that would be big improvement. The goal of question 7 in groupby is to stress complex expression by group so decomposing that into simple expression is not desirable. pandas, dask and polars (fyi @ritchie46) are currently using simple expressions, that should be amended, whenever possible. Thanks for bringing that up. I think we can close this PR and I will fill the issue about adjusting mentioned solutions.