Open tiemvanderdeure opened 23 hours ago
The "traditional" way to do this is DataAPI.combine, as in DataFrames. Could we override that in DD? That could forward to reduce
internally, and also leave the door open for combine-time functions like https://dataframes.juliadata.org/stable/man/split_apply_combine/#:~:text=julia%3E%20combine(iris_gdf%2C%20nrow%2C%20%3APetalLength%20%3D%3E%20mean%20%3D%3E%20%3Amean)
There is also Rasters.combine. Maybe we could implement something similar that combines any DimArray of DimArrays with identical dimensions.
But then I think that if we can do this without combine, then that would be even better. I think that the X => identity
is slow because all the child arrays get singleton dimensions and 1x1xn arrays are slower to sum over than n-length vectors.
Would it be possible to make the behaviour such that if a Dimension is provided without a function, then we slice over this dimension. So that this would work and be fast, and there would be no need to combine at all.
groupeddata = groupby(mydata, Ti => yearmonth, X, Y)
mean.(groupeddata)
I have a pretty solid plan to define a new combine
method:
https://github.com/rafaqz/DimensionalData.jl/issues/865#issuecomment-2503684558
It will also accept lazy GroupBy objects.
combine(mean, GroupBy(A, args...; kw...))
Or
combine(mean, groupby(A, args...; kw...))
Just need the time to do it, and there are a lot of unmerged Rasters PRs to finish first
(I'm trying to avoid hacks, so broadcast should just do what it always does over an array of arrays. combine
lets us define our own clean semantics and also handle laziness)
I wasn't thinking about doing any broadcast hacks, I was just thinking to default a dimension to the dim => identity
pair in the constructor, and improve the performance of big DimGroupByArray
s. I looked into it a bit and the flame graph is all blue, but I think the OpaqueArray and DimSlices setup slows down the iteration, which is why mean was much slower than expected here.
But a good combine
would also be a great solution.
Right probably the opaque array just needs to forward a few methods to parent.
The reason to have it is nested dim arrays create a lot of problems but AbstractBasicDimArray don't have to have a parent
I think
groupby
already has some very nice functionality, but I tend to get stuck on the combining part of it.Typically when I use this I have some 3D spatiotemporal data, and I want to summarize over the time dimension, but leave the X and Y dimension as they are. So I do something like
At this point I want to get rid of the nested structure and get a 3D DimArray with X-Y-Ti dimensions again. In this case, this works:
But if I have too many groups to splat, then I need some
reduce
and it gets more complicated then I feel it has to be.Another attempt is to include the
X
andY
dimension in thegroupby
like soBut this for some reason is a lot slower than the first example (~0.5s, or 10 times slower).
I think this is a common use case, but none of the examples in the documentation show what to do here.
Can we agree on a "proper" way to do this, and add an example to the documentation?