rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
262 stars 38 forks source link

friction in dimension-specific `mean` of a `groupby` #745

Open haakon-e opened 1 week ago

haakon-e commented 1 week ago

Suppose I define

julia> times = Ti(Date(2000,1):Month(1):Date(2003,12))

julia> xs = X(1:3)

julia> A = rand(xs, times)
╭──────────────────────────╮
│ 3×48 DimArray{Float64,2} │
├──────────────────────────┴──────────────────────────────────────────────────────────────── dims ┐
  ↓ X  Sampled{Int64} 1:3 ForwardOrdered Regular Points,
  → Ti Sampled{Date} Date("2000-01-01"):Month(1):Date("2003-12-01") ForwardOrdered Regular Points
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
 ↓ →   2000-01-01   2000-02-01  …   2003-11-01   2003-12-01
 1    0.666398             0.484909                0.168768             0.280385
 2    0.00175533           0.17132                 0.323891             0.717875
 3    0.116884             0.503126                0.927827             0.33975

Then I want to groupby and average over years. If I just do "vanilla" mean, I get a DimArray back:

julia> mean.(groupby(A, Ti => year))
╭───────────────────────────────╮
│ 4-element DimArray{Float64,1} │
├───────────────────────────────┴──────────────────────────────────────── dims ┐
  ↓ Ti Sampled{Int64} [2000, 2001, 2002, 2003] ForwardOrdered Irregular Points
├──────────────────────────────────────────────────────────────────── metadata ┤
  Dict{Symbol, Any} with 1 entry:
  :groupby => :Ti=>year
└──────────────────────────────────────────────────────────────────────────────┘
 2000  0.500896
 2001  0.605857
 2002  0.450127
 2003  0.52983

However, this also averages over X. So instead, I specify mean(A, dims=Ti)

julia> B = mean.(groupby(A, Ti => year), dims = Ti)
╭──────────────────────────────────────────────────╮
│ 4-element DimGroupByArray{DimArray{Float64,2},1} │
├──────────────────────────────────────────────────┴───────────────────── dims ┐
  ↓ Ti Sampled{Int64} [2000, 2001, 2002, 2003] ForwardOrdered Irregular Points
├──────────────────────────────────────────────────────────────────── metadata ┤
  Dict{Symbol, Any} with 1 entry:
  :groupby => :Ti=>year
├────────────────────────────────────────────────────────────────── group dims ┤
  ↓ X, → Ti
└──────────────────────────────────────────────────────────────────────────────┘
 2000  3×1 DimArray
 2001  3×1 DimArray
 2002  3×1 DimArray
 2003  3×1 DimArray

This keeps the output as a DimGroupByArray. I feel I should get a DimArray back here?

The extra step I need to do to get a DimArray is cat:

julia> cat(B..., dims=Ti)
╭─────────────────────────╮
│ 3×4 DimArray{Float64,2} │
├─────────────────────────┴──────────────────────────────────────────────── dims ┐
  ↓ X  Sampled{Int64} 1:3 ForwardOrdered Regular Points,
  → Ti Sampled{Date} [2000-07-01, …, 2003-07-01] ForwardOrdered Irregular Points
└────────────────────────────────────────────────────────────────────────────────┘
 ↓ →   2000-07-01   2001-07-01   2002-07-01   2003-07-01
 1    0.543548             0.644249             0.412775             0.51378
 2    0.48329              0.550993             0.443599             0.579192
 3    0.475849             0.62233              0.494006             0.496519

I wonder if this last step could be baked into the reduction step? If not, perhaps we could think of some way of documenting how to get back a DimArray from a DimGroupByArray (e.g. using cat).

rafaqz commented 1 week ago

Not much we can do with a broadcast. Maybe we need a dedicated combine function for this.

combine(mean, groupby(A, Ti => year))

We can store the time dim in DimGroupByArray so we know to pass it to dims in mean?

Or maybe combine accepts keywords that get passed to the function

haakon-e commented 1 week ago

Not much we can do with a broadcast. Maybe we need a dedicated combine function for this.

I don't quite understand this. Why then does mean.(groupby(A, Ti => year)) return a DimArray instead of DimGroupByArray?

Alternatively, if we define mean(B::DimGroupByArray) directly, I should be able to do what I expect.

In fact, currently doing mean(B) somewhat confusingly (as I see it) averages across groups. Perhaps that's worth a separate issue:

details ```julia julia> xs = X(1:3) X 1:3 julia> tim = Ti(Dates.Date("2005-11-01"):Dates.Month(1):Dates.Date("2006-02-01")) Ti Date("2005-11-01"):Month(1):Date("2006-02-01") julia> AA = DimArray(reshape(1.:length(xs)*length(tim), length(xs), :), (xs,tim)) ╭─────────────────────────╮ │ 3×4 DimArray{Float64,2} │ ├─────────────────────────┴───────────────────────────────────────────────────────────────── dims ┐ ↓ X Sampled{Int64} 1:3 ForwardOrdered Regular Points, → Ti Sampled{Date} Date("2005-11-01"):Month(1):Date("2006-02-01") ForwardOrdered Regular Points └─────────────────────────────────────────────────────────────────────────────────────────────────┘ ↓ → 2005-11-01 2005-12-01 2006-01-01 2006-02-01 1 1.0 4.0 7.0 10.0 2 2.0 5.0 8.0 11.0 3 3.0 6.0 9.0 12.0 julia> B = groupby(AA, Ti => year) ╭──────────────────────────────────────────────────╮ │ 2-element DimGroupByArray{DimArray{Float64,1},1} │ ├──────────────────────────────────────────────────┴───────── dims ┐ ↓ Ti Sampled{Int64} [2005, 2006] ForwardOrdered Irregular Points ├──────────────────────────────────────────────────────── metadata ┤ Dict{Symbol, Any} with 1 entry: :groupby => :Ti=>year ├────────────────────────────────────────────────────── group dims ┤ ↓ X, → Ti └──────────────────────────────────────────────────────────────────┘ 2005 3×2 DimArray 2006 3×2 DimArray julia> mean(B) ╭─────────────────────────╮ │ 3×2 DimArray{Float64,2} │ ├─────────────────────────┴───────────────────────────────────────────────────────────── dims ┐ ↓ X Sampled{Int64} 1:3 ForwardOrdered Regular Points, → Ti Sampled{Date} [Date("2005-11-01"), Date("2005-12-01")] ForwardOrdered Irregular Points └─────────────────────────────────────────────────────────────────────────────────────────────┘ ↓ → 2005-11-01 2005-12-01 1 4.0 7.0 2 5.0 8.0 3 6.0 9.0 ``` ### HOWEVER This only works by chance, since the dimensions of each group happened to be the same. If I instead extend the time dimension by 1 month, so that the groups have different sizes: ```julia julia> tim = Ti(Dates.Date("2005-11-01"):Dates.Month(1):Dates.Date("2006-03-01")) Ti Date("2005-11-01"):Month(1):Date("2006-03-01") julia> AA = DimArray(reshape(1.:length(xs)*length(tim), length(xs), :), (xs,tim)) ╭─────────────────────────╮ │ 3×5 DimArray{Float64,2} │ ├─────────────────────────┴───────────────────────────────────────────────────────────────── dims ┐ ↓ X Sampled{Int64} 1:3 ForwardOrdered Regular Points, → Ti Sampled{Date} Date("2005-11-01"):Month(1):Date("2006-03-01") ForwardOrdered Regular Points └─────────────────────────────────────────────────────────────────────────────────────────────────┘ ↓ → 2005-11-01 2005-12-01 2006-01-01 2006-02-01 2006-03-01 1 1.0 4.0 7.0 10.0 13.0 2 2.0 5.0 8.0 11.0 14.0 3 3.0 6.0 9.0 12.0 15.0 julia> B = groupby(AA, Ti => year) ╭──────────────────────────────────────────────────╮ │ 2-element DimGroupByArray{DimArray{Float64,1},1} │ ├──────────────────────────────────────────────────┴───────── dims ┐ ↓ Ti Sampled{Int64} [2005, 2006] ForwardOrdered Irregular Points ├──────────────────────────────────────────────────────── metadata ┤ Dict{Symbol, Any} with 1 entry: :groupby => :Ti=>year ├────────────────────────────────────────────────────── group dims ┤ ↓ X, → Ti └──────────────────────────────────────────────────────────────────┘ 2005 3×2 DimArray 2006 3×3 DimArray julia> mean(B) ERROR: DimensionMismatch: dimensions must match: a has dims (DimensionalData.Dimensions.DimUnitRange(Base.OneTo(3), X{DimensionalData.Dimensions.Lookups.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.Lookups.ForwardOrdered, DimensionalData.Dimensions.Lookups.Regular{Int64}, DimensionalData.Dimensions.Lookups.Points, DimensionalData.Dimensions.Lookups.NoMetadata}}([1, 2, 3])), DimensionalData.Dimensions.DimUnitRange(Base.OneTo(2), Ti{DimensionalData.Dimensions.Lookups.Sampled{Date, SubArray{Date, 1, StepRange{Date, Month}, Tuple{Vector{Int64}}, false}, DimensionalData.Dimensions.Lookups.ForwardOrdered, DimensionalData.Dimensions.Lookups.Irregular{Tuple{Date, Date}}, DimensionalData.Dimensions.Lookups.Points, DimensionalData.Dimensions.Lookups.NoMetadata}}([Date("2005-11-01"), Date("2005-12-01")]))), b has dims (DimensionalData.Dimensions.DimUnitRange(Base.OneTo(3), X{DimensionalData.Dimensions.Lookups.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.Lookups.ForwardOrdered, DimensionalData.Dimensions.Lookups.Regular{Int64}, DimensionalData.Dimensions.Lookups.Points, DimensionalData.Dimensions.Lookups.NoMetadata}}([1, 2, 3])), DimensionalData.Dimensions.DimUnitRange(Base.OneTo(3), Ti{DimensionalData.Dimensions.Lookups.Sampled{Date, SubArray{Date, 1, StepRange{Date, Month}, Tuple{Vector{Int64}}, false}, DimensionalData.Dimensions.Lookups.ForwardOrdered, DimensionalData.Dimensions.Lookups.Irregular{Tuple{Date, Date}}, DimensionalData.Dimensions.Lookups.Points, DimensionalData.Dimensions.Lookups.NoMetadata}}([Date("2006-01-01"), Date("2006-02-01"), Date("2006-03-01")]))), mismatch at 2 Stacktrace: [...] ```

Defining something as naive as: (which is essentially what it appears you propose, except I'm skipping combine)

function Statistics.mean(B::DimensionalData.DimGroupByArray)
    groupby_dim = name(dims(B))
    reduced_B = mean.(B, dims=groupby_dim)
    cat(reduced_B..., dims=groupby_dim)
end

appears to work (and is essentially what I'm doing already), but a look at it with @code_warntype appears it does not yield type-stable code, because I just do mean.(B, dims=groupby_dim) under the hood.


so... perhaps something like combine and some clever use of the stored type information, as you suggest, is indeed a reasonable way forward?

I was trying to check how e.g. xarray does this, but they seem to essentially directly support mean(grouped_array).

rafaqz commented 1 week ago

The problem is it's pretty ambiguous what mean means here. XArray hacks mean to do an inner mean and cat like you want. But that's not very Julian.

If DimGroupbyArray is actually an AbstractArray we can't break the interface - it's an array of arrays and should act like any other array of arrays

I guess you are suggesting it should not be an AbstractArray ? (To not get that confusing result from mean)

And yes combine is specifically my idea to do what XArray does but without the magic inner mean behaviour from the regular mean method. This will also mean it works for any function, which is not true in xarray

haakon-e commented 1 week ago

Alright, I see it's not julian to apply mean(B, dims=Ti) directly with the output I am expecting, so combine seems to be a good idea. Do you know if the use of combine is found in other similar packages/languages? It's probably a good idea to use terminology/approaches people find intuitive already.

I'm not sufficiently steeped in this code base to have a well-informed opinion whether DimGroupByArray should be an AbstractArray or not. That said, I don't imagine the use of the current result from mean(B::DimGroupByArray), so perhaps a dispatch could be added which errors? I'm not sure if changing the underlying abstract type could have other unintended consequences..