rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
271 stars 38 forks source link

Allow Between on `Unordered` #550

Open kdheepak opened 11 months ago

kdheepak commented 11 months ago

Hi,

I'm created a DimArray with multiple dimensions where the dimensions have categorical data. Sometimes I get Unordered or ForwardOrdered dimensions. The categorical data is always an eltype(dim) == String.

I'm assuming it does an alphabetical sort and checks if that matches the original data and if it does it creates a ForwardOrdered and otherwise it creates Unordered.

Is there a way to force it to always create ForwardOrdered dimensions? We have strings that are not ordered alphabetically but are always going to be in the same order and we would like to use Between("String1", "String2") in some cases.

Thanks!

rafaqz commented 11 months ago

You can fully construct a Categorical or Sampled lookup yourself if you know the values. Rasters.jl does this.

Autodection is for end users - strings and symbols are always categorical, and sorting is checked in the constructor.

But the order actually has meaning in dispatch - it tells us if we can do a binary search with searchsortedfirst and which direction to do it in. Unordered needs to use the slower findfirst.

If you use the wrong type you will get wrong answers in At and Between.

rafaqz commented 11 months ago

I guess we could make Between work on Unordered ?

It just needs a between function for that using findfirst. Its in the selectors.jl file in the LookupArrays module.

(This wasnt implemented because Between is mostly synonymous and now replaced by .. and selecting intervals over unsorted lists is kinda weird)

kdheepak commented 11 months ago

You can fully construct a Categorical or Sampled lookup yourself if you know the values.

Do you have an example of this? I would like to construct @dim Enduse or @dim Tech as well to build types of these dimensions but I'm only seeing examples of Categorical data used with X or Y

I was expecting something like this to work but I'm getting errors:

julia> @dim Tech
julia> DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = Tech(DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered()))))

ERROR: MethodError: no method matching format(::Tech{DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, ::Type{Tech}, ::Base.OneTo{Int64})

Some of our data has the following form (or at least in debugging it'll be useful to be able to generate this following form):

julia> Enduse_short_name = ["SF", "MF"];

julia> Enduse_long_name = ["Single Family", "Small Family"];

julia> @assert length(Enduse_short_name) == length(Enduse_long_name)

julia> @assert sort(Enduse_short_name) != Enduse_short_name

julia> @assert sort(Enduse_long_name) == Enduse_long_name

variable1 = DimArray(..., [Enduse => Enduse_short_name, ...])
variable2 = DimArray(..., [Enduse => Enduse_long_name, ...])

variable1[Enduse = Between("SF", "MF")] == variable2[Enduse = Between("Small Family", "Single Family")]

The type that gets generated for variable1 is not the same as the type that gets generated for variable2. So right now, Between works only for the sorted dimension.

I would basically like to force ForwardOrdered for all the categorical data that I have.

pmodv commented 11 months ago

I constructed them as so:

dimY = Y(Categorical([:al,:bet,:gim,:dal]; order = ForwardOrdered()))
dimX = X(Categorical([:a,:b,:g,:d]; order = ForwardOrdered()))

Then:

DimArray(zeros(4,4), (dimX,dimY))

... which results in (for me):

4×4 DimArray{Float64,2} with dimensions: 
  X Categorical{Symbol} Symbol[:a, :b, :g, :d] ForwardOrdered,
  Y Categorical{Symbol} Symbol[:al, :bet, :gim, :dal] ForwardOrdered
       :al   :bet   :gim   :dal
  :a  0.0   0.0    0.0    0.0
  :b  0.0   0.0    0.0    0.0
  :g  0.0   0.0    0.0    0.0
  :d  0.0   0.0    0.0    0.0
rafaqz commented 11 months ago

@kdheepak there is just a mistake in your code - you have Tech twice:

DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = Tech(DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered()))))

You need to either use the NamedTuple syntax where the names are the dimension type and the value is the Array or LookupArray:

DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered())))

or use a Tuple syntax where the tuple contains Dimensions

DimArray(zeros(2,2), (Enduse(["hello", "world"]), Tech(DD.Categorical(["hello", "world"], order=DimensionalData.ForwardOrdered()))))

We probably need a better error message for when you mix them, I've never had time to make the error messages better.

rafaqz commented 11 months ago

@pmodv do notice that your Y dimension is broken - it isn't ForwardOrdered. You will get the wrong results with At because it will try to use searchsortedfirst. If the lookup is not sorted it needs to be Unordered.

rafaqz commented 11 months ago

Instead of forcing ForwardOrdered the real question is from above:

I guess we could make Between work on Unordered ?

kdheepak commented 11 months ago

Thanks for the quick replies! Removing the Tech constructor works. That resolves this issue I think and it can be closed.

I'm curious. If I execute @dim Tech after constructing a DimArray, the old array has Dim{:Tech} but any new arrays constructed has Tech:

julia> using DimensionalData: @dim

julia> using DimensionalData

julia> import DimensionalData as DD

julia> arr = DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = DD.Categorical(["zzz", "world"], order=DimensionalData.ForwardOrdered())))
2×2 DimArray{Float64,2} with dimensions: 
  Dim{:Enduse} Categorical{String} String["hello", "world"] ForwardOrdered,
  Dim{:Tech} Categorical{String} String["zzz", "world"] ForwardOrdered
            "zzz"   "world"
  "hello"  0.0     0.0
  "world"  0.0     0.0

julia> @dim Tech

julia> arr
2×2 DimArray{Float64,2} with dimensions: 
  Dim{:Enduse} Categorical{String} String["hello", "world"] ForwardOrdered,
  Dim{:Tech} Categorical{String} String["zzz", "world"] ForwardOrdered
            "zzz"   "world"
  "hello"  0.0     0.0
  "world"  0.0     0.0

julia> arr = DimArray(zeros(2,2), (Enduse = ["hello", "world"], Tech = DD.Categorical(["zzz", "world"], order=DimensionalData.ForwardOrdered())))
2×2 DimArray{Float64,2} with dimensions: 
  Dim{:Enduse} Categorical{String} String["hello", "world"] ForwardOrdered,
  Tech Categorical{String} String["zzz", "world"] ForwardOrdered
            "zzz"   "world"
  "hello"  0.0     0.0
  "world"  0.0     0.0

It makes sense that the type of the old array doesn't change but how does the new array know about the Tech dimension type? I guess I don't quite understand how DimensionalData is able to track which types are created as Dimensions using @dim? Is it looking for any subtypes of a dimension with the same name?

kdheepak commented 11 months ago

I guess we could make Between work on Unordered ?

That would work but are you suggesting implementing functions on our end or making updates to DimensionalData.jl?

rafaqz commented 11 months ago

Yeah - the Dim{:Tech} thing is kinda confusing. Thats the default object when there is no available type, and you cant make it inherit from anything. (You can do @dim XDim Tech to make it plot on x axes).

It knows afterwards because the macro adds methods for the internal key2dim that dispatch on Val{:Tech}.

Yes, kinda too magical really but I was into that stuff three years ago...

kdheepak commented 11 months ago

Cool! That makes sense! Feel free to close the issue as resolved.

kdheepak commented 11 months ago

It looks like I what I really want is to force everything to be Unordered so that At works always, even if it is slower. I think to use this we'd want to Between working on Unordered.

I'm happy to submit a PR if you can give some direction on how to go about doing this.

kdheepak commented 11 months ago

I guess we could make Between work on Unordered ?

It just needs a between function for that using findfirst. Its in the selectors.jl file in the LookupArrays module.

(This wasnt implemented because Between is mostly synonymous and now replaced by .. and selecting intervals over unsorted lists is kinda weird)

I'm assuming something has to be implemented in DimensionalData.jl for this to work, right?


julia> arr = DimArray(zeros(2,2), (Enduse = DD.Categorical(["hello", "world"], order=DD.Unordered()), Tech = DD.Categorical(["zzz", "world"], order=DD.Unordered()))) 

julia> arr[Tech=At("zzz"):At("world")]
ERROR: MethodError: no method matching isless(::At{String, Nothing, Nothing}, ::At{String, Nothing, Nothing})

Closest candidates are:
  isless(::Any, ::Missing)
   @ Base missing.jl:88
  isless(::Missing, ::Any)
   @ Base missing.jl:87

julia> arr[Tech=At("zzz")..At("world")]
ERROR: ArgumentError: Cannot use an interval or `Between` with Unordered
Stacktrace:
rafaqz commented 11 months ago

Why force Unordered? At works fine ether way as long as the type matches the state of the lookup. (Or youre saying you dont know if its ordered?)

Between or .. working on Unordered is doable... Its still ordered in the sense of the lookup.

But confusion is possoble if people expect to still get indices alphabetically between their values.

rafaqz commented 11 months ago

Yourv At example is just .. but not implemented. It wont need the At parts.

You can pit a vector inside At if you want that, but you have to specify everything

kdheepak commented 11 months ago

Why force Unordered?

Mainly because it is easier to know the types ahead of time. I'm playing around with using DimStack which I believe would avoid the need for this, but at the moment we have a lot of code to port and test to use DimensionalData.jl and I want to experiment more before committing to using the package. Our current code uses findfirst(predicate, categorical_data) and findall(predicate, categorical_data) and it would be useful to force Unordered to see if there's any performance difference. I'm expecting there would be little to none if type inference doesn't fail.

Or youre saying you dont know if its ordered?

Our users / modelers may add categorical data to existing data and it may change from ForwardOrdered to Unordered, so I want to be able to allow them to write code that works on either of them exposing them to ForwardOrdered / Unordered / other implementation details.

But confusion is possoble if people expect to still get indices alphabetically between their values.

I'm not exactly following what confusion would occur but I'm sure there's something you are referring to that I'm missing. Can you expand on what you think is the issue with Between working on categorical data, where the implementation basically does findfirst(a):findlast(b) for Between(a, b)?

rafaqz commented 11 months ago

Unordered is much slower for large arrays, because binary search will beat linear search.

But for small arrays Unordered may be faster

rafaqz commented 11 months ago

The confusion is that with ForwardOrdered Between or .. have two meanings that both hold.

  1. All indices with lookup values between specific values, and

  2. all indices between the two outer indices with specified lookup values.

You are arguing for the secong meaning to be canonical, but someone could equally argue for the first - then Between would work like Where over the interval.

pmodv commented 11 months ago

Can't we solve this by indexing with @enum?

E.g:

@enum Idx idxfirst = 1 idxsecond = 2

A = [1,2,3,4]

Base.to_index(i::Idx) = Int(i)

... and thus...

julia> A[idxfirst]
1
kdheepak commented 11 months ago

You are arguing for the second meaning to be canonical, but someone could equally argue for the first - then Between would work like Where over the interval.

For categorical data, is there a difference? In categorical data, findfirst(A, idx) == findlast(A, idx), right?

rafaqz commented 11 months ago

Ordered categories are a thing, so its really no different to Sampled.

We can make this change, Im just sure I will get a bug report one day where someone expected the opposite. So we should be sure that its needed.