rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
283 stars 42 forks source link

Easiest way to combine after `groupby` #874

Open tiemvanderdeure opened 23 hours ago

tiemvanderdeure commented 23 hours ago

I think groupby already has some very nice functionality, but I tend to get stuck on the combining part of it.

Typically when I use this I have some 3D spatiotemporal data, and I want to summarize over the time dimension, but leave the X and Y dimension as they are. So I do something like

mydata = rand(X(1:100), Y(1:100), Ti(Date(2000,1,1):Day(1):Date(2005,1,1)))
groupeddata = groupby(mydata, Ti => yearmonth)
groupedmean = mean.(groupeddata; dims = Ti)

At this point I want to get rid of the nested structure and get a 3D DimArray with X-Y-Ti dimensions again. In this case, this works:

cat(groupedmean...; dims = Ti)

But if I have too many groups to splat, then I need some reduce and it gets more complicated then I feel it has to be.

Another attempt is to include the X and Y dimension in the groupby like so

groupeddata = groupby(mydata, Ti => yearmonth, X => identity, Y => identity)
mean.(groupeddata)

But this for some reason is a lot slower than the first example (~0.5s, or 10 times slower).

I think this is a common use case, but none of the examples in the documentation show what to do here.

Can we agree on a "proper" way to do this, and add an example to the documentation?

asinghvi17 commented 23 hours ago

The "traditional" way to do this is DataAPI.combine, as in DataFrames. Could we override that in DD? That could forward to reduce internally, and also leave the door open for combine-time functions like https://dataframes.juliadata.org/stable/man/split_apply_combine/#:~:text=julia%3E%20combine(iris_gdf%2C%20nrow%2C%20%3APetalLength%20%3D%3E%20mean%20%3D%3E%20%3Amean)

tiemvanderdeure commented 22 hours ago

There is also Rasters.combine. Maybe we could implement something similar that combines any DimArray of DimArrays with identical dimensions.

tiemvanderdeure commented 22 hours ago

But then I think that if we can do this without combine, then that would be even better. I think that the X => identity is slow because all the child arrays get singleton dimensions and 1x1xn arrays are slower to sum over than n-length vectors.

Would it be possible to make the behaviour such that if a Dimension is provided without a function, then we slice over this dimension. So that this would work and be fast, and there would be no need to combine at all.

groupeddata = groupby(mydata, Ti => yearmonth, X, Y)
mean.(groupeddata)
rafaqz commented 21 hours ago

I have a pretty solid plan to define a new combine method:

https://github.com/rafaqz/DimensionalData.jl/issues/865#issuecomment-2503684558

It will also accept lazy GroupBy objects.

combine(mean, GroupBy(A, args...; kw...))

Or

combine(mean, groupby(A, args...; kw...))

Just need the time to do it, and there are a lot of unmerged Rasters PRs to finish first

(I'm trying to avoid hacks, so broadcast should just do what it always does over an array of arrays. combine lets us define our own clean semantics and also handle laziness)

tiemvanderdeure commented 20 hours ago

I wasn't thinking about doing any broadcast hacks, I was just thinking to default a dimension to the dim => identity pair in the constructor, and improve the performance of big DimGroupByArrays. I looked into it a bit and the flame graph is all blue, but I think the OpaqueArray and DimSlices setup slows down the iteration, which is why mean was much slower than expected here.

But a good combine would also be a great solution.

rafaqz commented 20 hours ago

Right probably the opaque array just needs to forward a few methods to parent.

The reason to have it is nested dim arrays create a lot of problems but AbstractBasicDimArray don't have to have a parent