rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
277 stars 41 forks source link

Is `At()` Necessary for `Symbol`+`String` Inputs? #338

Open ParadaCarleton opened 2 years ago

ParadaCarleton commented 2 years ago

Would it make sense to infer, when a string or symbol is passed to getindex, that the user probably wants At(string)? Adding this would probably make it easier to transition a bunch of AxisArrays code to use DimensionalData.jl.

rafaqz commented 2 years ago

Yeah, this has been a contentious thing for a while. It initially required At - I chose it because its the shortest thing that could mean that to minimise how annoying this is... but then I switched it so what you want actually worked up until a few months ago... and then went back to requiring At! because automatic At is confusing. But you're right, it's mostly numbers that are weird without At when using DimArray.

Symbol can also be weird with DimStack - a single Symbol getindex is essentially NamedTuple indexing to get DimArray layers by name. But At(:symbol) is indexing the values of all the layers at the same time - which is a very different thing. And its nice if DimArray and DimStack have the same syntax. Of course with a Dimension wrapper its obvious what the Symbol means even for DimStack.

So to answer your question, no it's not totally necesary, but the semantics starts to feel a bit crowded without At wrappers.

I have thought about adding a trait somewhere for automatic At so people can choose that if they need it.

ParadaCarleton commented 2 years ago

Symbol can also be weird with DimStack - a single Symbol getindex is essentially NamedTuple indexing to get DimArray layers by name. But At(:symbol) is indexing the values of all the layers at the same time - which is a very different thing. And its nice if DimArray and DimStack have the same syntax. Of course with a Dimension wrapper its obvious what the Symbol means even for DimStack.

I think that makes some sense, but it definitely causes problems for anyone trying to transition their codebase from AxisArrays. That being said, isn't not having to use At() a way of keeping the interface more consistent between DimStack and DimArray? That way it's possible to index with a symbol regardless of whether DimArray or DimStack is being used. (Even though it's not using a dimension struct, I would naturally tend to think of the names for DimStack as being something like an extra dimension which can be indexed along.)

Alternatively, given that DimStack is probably going to see a lot less use than DimArray, the syntax for DimStack could be changed, and then symbols could be treated the same as symbols for DimArray; perhaps AtStack()? Or perhaps the dimension name could be used; indexing a DimStack as x[:dim_array] (not providing a dimension name) treats x as a named tuple, while x[person=:name] indexes all the layers at the same time regardless of whether you're using a DimStack or DimArray.

ParadaCarleton commented 2 years ago

Yeah, this has been a contentious thing for a while. It initially required At - I chose it because its the shortest thing that could mean that to minimise how annoying this is... but then I switched it so what you want actually worked up until a few months ago... and then went back to requiring At! because automatic At is confusing. But you're right, it's mostly numbers that are weird without At when using DimArray.

Symbol can also be weird with DimStack - a single Symbol getindex is essentially NamedTuple indexing to get DimArray layers by name. But At(:symbol) is indexing the values of all the layers at the same time - which is a very different thing. And its nice if DimArray and DimStack have the same syntax. Of course with a Dimension wrapper its obvious what the Symbol means even for DimStack.

So to answer your question, no it's not totally necesary, but the semantics starts to feel a bit crowded without At wrappers.

I have thought about adding a trait somewhere for automatic At so people can choose that if they need it.

It's worth noting that if symbol keys are automatically wrapped in At, switching from AxisKeys becomes a lot more viable for other packages -- for instance, we're considering making this swap in Turing, but having to change all the indexing-related syntax would be a definite hassle.

rafaqz commented 2 years ago

I'm open to working on a change if this will be blocking to use cases elsewhere. But we have to be careful with the syntax as we're trying to do more things than AxisArrays.jl does.

DimStack is actually pretty central to the reason for writing this package, mostly through RasterStack <: AbstractDimStack in Rasters.jl. It can represent a netcdf and similar structures with multiple layers with mixed number of dims in a form that is easy to use and fast to index. And it's a table...

The specific syntax clash is this edge case, which is unlikely to happen very often, but needs handling:

julia> using DimensionalData

julia> st = DimStack((a=rand(10), b=falses(10)), (X(Symbol.('a':'j')),))
DimStack with dimensions:
  X Categorical Symbol[a, b, …, i, j] ForwardOrdered
and 2 layers:
  :a Float64 dims: X (10)
  :b Bool dims: X (10)

julia> st[:a]
10-element DimArray{Float64,1} a with dimensions:
  X Categorical Symbol[a, b, …, i, j] ForwardOrdered
 0.764328
 0.810431
 0.37886
 0.689809
 0.368838
 0.284454
 0.665395
 0.326088
 0.762186
 0.596156

julia> st[At(:a)]
(a = 0.7643275682479944, b = false)
rafaqz commented 2 years ago

I may have a solution to this. We can use getproperty to access DimStack layers, instead of getindex, like:

julia> st.a
10-element DimArray{Float64,1} a with dimensions:
  X Categorical Symbol[a, b, …, i, j] ForwardOrdered
 0.764328
 0.810431
 0.37886
 0.689809
 0.368838
 0.284454
 0.665395
 0.326088
 0.762186
 0.596156

julia> st[:a]
(a = 0.7643275682479944, b = false)

That might even be nice to separate out the syntax a little more anyway - so that getindex is always array-like, rather than dict-like.

ParadaCarleton commented 2 years ago

I may have a solution to this. We can use getproperty to access DimStack layers, instead of getindex, like:

julia> st.a
10-element DimArray{Float64,1} a with dimensions:
  X Categorical Symbol[a, b, …, i, j] ForwardOrdered
 0.764328
 0.810431
 0.37886
 0.689809
 0.368838
 0.284454
 0.665395
 0.326088
 0.762186
 0.596156

julia> st[:a]
(a = 0.7643275682479944, b = false)

That might even be nice to separate out the syntax a little more anyway - so that getindex is always array-like, rather than dict-like.

Definitely like that solution! Sounds like a good idea to me.

rafaqz commented 2 years ago

This is going to break a LOT of scripts built on Rasters.jl.

I will have to check in with people and see what they think, so it might take a while.

ParadaCarleton commented 2 years ago

This is going to break a LOT of scripts built on Rasters.jl.

I will have to check in with people and see what they think, so it might take a while.

Let me know what people think when you've talked to them about this.

I think this is a good idea, though; in general, transitioning from AxisArrays to DimensionalData should be as smooth as possible, maybe even automatic (as in, we could write a script that takes all code written with AxisArrays and converts it into DimensionalData code).