rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
281 stars 42 forks source link

keyword indexing fails for 0-dimensional DimArray #423

Closed sethaxen closed 8 months ago

sethaxen commented 1 year ago
julia> using DimensionalData

julia> x = DimArray(fill(3), ());

julia> y = DimArray(randn(2, 3), (Dim{:a}(1:2), Dim{:b}(1:3)));

julia> y[c=1]  # for a >0-dim array, indexing a missing dim just raises a warning
┌ Warning: (Dim{:c},) dims were not found in object
└ @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/primitives.jl:645
2×3 DimArray{Float64,2} with dimensions: 
  Dim{:a} Sampled{Int64} 1:2 ForwardOrdered Regular Points,
  Dim{:b} Sampled{Int64} 1:3 ForwardOrdered Regular Points
     1          2         3
 1   0.295756   1.10136   0.855181
 2  -0.963257  -0.4447   -0.95428

julia> x[c=1] # but for 0-dim we StackOverflow
ERROR: StackOverflowError:
Stacktrace:
 [1] dims2indices(x::Tuple{}, I::Tuple{Dim{:c, Int64}}) (repeats 79984 times)
   @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/indexing.jl:34
sethaxen commented 1 year ago

I thought what was missing was the following method

julia> @inline Dimensions.dims2indices(::Tuple{}, I) = ()

But this has an unintended result. Indexing by a missing keyword then behaves like x[]:

julia> x[c=1]
3

julia> ds = DimStack((; x, y));

julia> ds[c=1]
┌ Warning: (Dim{:c},) dims were not found in object
└ @ DimensionalData.Dimensions ~/.julia/packages/DimensionalData/K9D4P/src/Dimensions/primitives.jl:645
(x = 3, y = [-1.5090420829521751 -0.629637266507813 -1.1542533262114874; 1.78624724297272 -1.341032743849202 0.2629446089245767])

I would expect x[c=1] to return x as a 0-dimensional array with a warning, and similarly I would expect indexing with ds to return the dataset unchanged.

rafaqz commented 1 year ago

It is pretty weird I agree. But I would also have defined your missing method, expecting the result you got.

We fill in missing dimensions with : in dimensional indexing so we can be concise. But here it means calling getindex(x) after splatting in the empty tuple.

So I guess you are asking to special case this to not call getindex at all for zero dim arrays? Maybe only with DimStack?

That also feels pretty wierd to me, getindex call on a zero dim array always returns a scalar (unless you add extra colons). But honestly I don't know what we should do here and if it should be different to Base, thats probably why the bug is still here.

view actually does what you want here by the way, as Base doesn't unwrap with view(x). That could be a way to deal with this.

Edit: another way to explain why I expect your result is that getindex always creates a new array for all stack layers, and it still gets called on each layer even when there are no matching dims - just with colons for all dimensions.

This keeps to to our mental model for arrays - that altering an object after getindex does not change the original.

Maybe what we want for DimStack is the exact opposite of Base.dotview: an indexing method that always assigns a new array, but never unwraps zero dims - it copies them instead.

sethaxen commented 1 year ago

Perhaps I have the wrong mental model of DimStack, but I would expect that getindex with just keyword arguments would always return a DimStack. It's strange to me that if I put a 0-dimensional array in the DimStack, then I suddenly don't get a DimStack output anymore when I use keyword arguments to getindex. The result isn't even a valid Table.

view actually does what you want here by the way, as Base doesn't unwrap with view(x). That could be a way to deal with this.

That's good to know and could let me work around this. I should probably be viewing more than getindexing anyways.

rafaqz commented 1 year ago

It was hard to work out the behaviour when you index arrays with different dimensions all at the same time - because getindex returns a scalar if you index with all Int, but an array with Colon or a range. So if we index with Int for all the dimensions of the smallest array we end up with a mix of scalars and arrays.

But from this discussion I think using regular getindex with DimStack was actually a mistake - we should instead define a noscalargetindex method I described in my last comment - that always returns an array.

Then we always have a DimStack of DimArray that can become a DimTable. Even when all of the arrays have no dimensions after indexing.

Would that be better?

sethaxen commented 1 year ago

But from this discussion I think using regular getindex with DimStack was actually a mistake - we should instead define a noscalargetindex method I described in my last comment - that always returns an array.

If such a method existed that would indeed be useful. But IIUC, this could be implemented as

noscalargetindex(s::AbstractDimStack; kwargs...) = copy(view(s; kwargs...))

so maybe it's better to just document that users who want this behavior should use this approach instead of having a special method?

sethaxen commented 1 year ago

Ah wait, it seems view is not guaranteed to return an AbstractDimStack?

julia> x = DimArray(randn(2), :a)
2-element DimArray{Float64,1} with dimensions: Dim{:a}
 1  -0.753079
 2  -0.502682

julia> ds = DimStack((; x))
DimStack with dimensions: Dim{:a}
and 1 layer:
  :x Float64 dims: Dim{:a} (2)

julia> view(ds; a=1)
(x = fill(-0.753079230064672),)
rafaqz commented 1 year ago

That's a problem here rather than in Base so easy fix. Maybe view has the same check as getindex to not rebuild, that we just need to speialise.

sethaxen commented 1 year ago

That's a problem here rather than in Base so easy fix. Maybe view has the same check as getindex to not rebuild, that we just need to speialise.

Ah, it seems to be due to this inconsistency:

julia> x = DimArray(randn(2), :a);

julia> y = DimArray(randn(2, 3), (:b, :c));

julia> view(x; a=1) # a `DimArray` is not returned
0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
-1.1244365106893395

julia> view(x, 1)  # what the above dispatches to
0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
-1.1244365106893395

julia> z = DimArray(fill(randn()), ()); # also a problem for 0-dim arrays

julia> view(z) # fine with no indices
0-dimensional DimArray{Float64,0}: 

0.477376

julia> view(z, 1, 1)  # or 2
0-dimensional DimArray{Float64,0}: 

0.477376

julia> view(z, 1) # but not a DimArray with 1 index
0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.4773758383436243

julia> view(y; b=1, c=1)  # if we index all dimensions, a `DimArray` is still returned
0-dimensional DimArray{Float64,0}: and reference dimensions: Dim{:b}, Dim{:c}

0.364411

julia> view(y, 1, 1)  # what the above dispatches to
0-dimensional DimArray{Float64,0}: and reference dimensions: Dim{:b}, Dim{:c}

0.364411

The issues seems to be this overload, https://github.com/rafaqz/DimensionalData.jl/blob/70f2aefc6c4248a0ab957f51cf8b415f9799a39e/src/array/indexing.jl#L17

For more than one integer index, there's a view-specific overload: https://github.com/rafaqz/DimensionalData.jl/blob/70f2aefc6c4248a0ab957f51cf8b415f9799a39e/src/array/indexing.jl#L43-L49

I can open a PR fixing this.