rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
271 stars 38 forks source link

How to generate type ahead of time #549

Closed kdheepak closed 11 months ago

kdheepak commented 11 months ago

I have a DimArray with the following content:

julia> variable
6×10×43×25×67 DimArray{Float64,5} with dimensions: 
  Dim{:Enduse} Categorical{String} String[Process Heat, Motors, …, Off-Road, Excess Steam] Unordered,
  Dim{:Tech} Categorical{String} String[Electric, Gas, …, Steam, Fuel Cell] Unordered,
  Dim{:EC} Categorical{String} String[Food & Tobacco, Textiles, Apparel & Leather, …, Crop Production, Animal Production] Unordered,
  Dim{:Area} Categorical{String} String[Ontario, Quebec, …, Mexico, Rest of World] Unordered,
  Dim{:Year} Categorical{String} String[1985, 1986, …, 2050, 2051] ForwardOrdered
[:, :, 1, 1, 1]
                           "Electric"   "Gas"   "Coal"   "Oil"   "Biomass"   "Solar"   "LPG"   "Off-Road"   "Steam"   "Fuel Cell"
  "Process Heat"          0.0          0.0     0.0      0.0     0.0         0.0       0.0     0.0          0.0       0.0
  "Motors"                0.0          0.0     0.0      0.0     0.0         0.0       0.0     0.0          0.0       0.0
  "Other Substitutables"  0.0          0.0     0.0      0.0     0.0         0.0       0.0     0.0          0.0       0.0
  "Miscellaneous"         0.0          0.0     0.0      0.0     0.0         0.0       0.0     0.0          0.0       0.0
  "Off-Road"              0.0          0.0     0.0      0.0     0.0         0.0       0.0     0.0          0.0       0.0
  "Excess Steam"          0.0          0.0     0.0      0.0     0.0         0.0       0.0     0.0          0.0       0.0
[and 72024 more slices...]

This is what I get with the typeof(variable):

julia> typeof(variable)
DimArray{Float64, 5, Tuple{Dim{:Enduse, DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:Tech, DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:EC, DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:Area, DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:Year, DimensionalData.Dimensions.LookupArrays.Categorical{String, Vector{String}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Array{Float64, 5}, DimensionalData.NoName, DimensionalData.Dimensions.LookupArrays.NoMetadata}

in the data that I'm working with, the "dimensions" are always categorical data (except for Year which is the last dimension).

Before using DimensionalData.jl, I was storing this in a struct like this:

struct Data
  variable::Array{Float64, 5}
end

Data(; variable = read_data_from_hdf5_file("/group/variable")) = Data(variable)

Now, with DimensionalData.jl, I want to store it in the struct:

struct Data
  variable::create_type_on_the_fly(:Enduse, :Tech, :EC, :Area, :Year)
end

Data(; variable = read_data_from_hdf5_file_and_create_instance_of_dimarray("/group/variable")) = Data(variable)

Is there a easy way to do this? If I do the following (i.e. not defining an explicit type), it ends up being a lot slower.

struct Data
  variable
end

Data(; variable = read_data_from_hdf5_file_and_create_instance_of_dimarray("/group/variable")) = Data(variable)

This is a simplified example. In the actual code I'm working with, there's more than 200 such variables in a single struct.

rafaqz commented 11 months ago
struct Data{T}
    variable::T
end
kdheepak commented 11 months ago

Thanks for the answer! If all the types of my struct were unique, this seems very impractical for me to write:

struct Data{T1, T2, T3, T4, T5, ..., T198, T199, T200}
  variable1::T1
  variable2::T2
  variable3::T3
  ...
  variable200::T200
end

They are not all unique so it is not going so many types but it is still going to be a couple dozen of so.

struct Data{T1, T2, ..., T20}
  variable1::T1
  variable2::T9
  variable3::T20
  ...
  variable200::T5
end

This seems like it is going to be a very error prone for users that I'm working with.

Is there another way you think I can go about this?

rafaqz commented 11 months ago

If you have that many arrays, why not use a macro to define you struct?

But essentially, if you want type stable structs, you have to use type parameters like that. That just julia, not DimensionalData.jl.

You would not manually specify the type of most other AbstractArray from packages, the exact type is often not part of the interface.

Also, what you have looks like a DimStack, which is just as fast (as a fully type stable version) but more organised that your struct.

kdheepak commented 11 months ago

I guess I was thinking I only need to use type parameters if I want the struct to be generic. What I would really like to do is this:

const T1 = create_type_on_the_fly(:Enduse, :Tech, :EC, :Area, :Year)
const T2 = create_type_on_the_fly(:Enduse, :Tech, :EC, :Year)
...
const T20 = create_type_on_the_fly(:A, :B, :C, :D, :Year)

struct Data
  variable1::T1
  variable2::T9
  variable3::T20
  ...
  variable200::T5
end

My current workaround was to define something like this:

function create_type_on_the_fly(dims...)
  values = (get_categorical_data_for_dimension(d) for d in dims)
  arr = zeros(Float64, length.(values)...)
  nt = NamedTuple{dims}(values)
  typeof(DimArray(arr, nt))
end

I was hoping for a better way.

In our case, we probably don't even need a single struct. Everything is in a HDF5 file, and we can probably read and write the data directly from a HDF5 instead of reading it into a struct and writing it back from the struct. I think we'll run into type stability issues there too though.

What do you imagine a macro might look like? I know how to write macros but I'm not exactly sure how a macro will help here? Because it is not just syntax transformations right? I don't know the type that needs to be used.

I also am not sure if there'd be performance issues for defining a generic struct with so many type parameters (~20-30 at the moment, may increase) and wanted to explicitly type everything for that reason.

I'm new to DimensionalData.jl, I did see DimStack in the documentation but haven't had the chance to play around with it yet! I'll check it.

rafaqz commented 11 months ago

Dont do that on_the_fly thing... just use type parameters {T} thats literally what they are, but cleaner.

And really DimStack is what you want. Its a hybrid of a DimArray and a NamedTuple. The dimensions of all array layers must match, but they dont have to use all dimensions. That seems like what you are doing.

kdheepak commented 11 months ago

The dimensions of all array layers are not the same unfortunately. There's all sorts of combinations, e.g.:

(:Enduse, :Tech, :EC, :Area, :Year)
(:Fuel, :Tech, ...)
(:Fuel, :Pollution, ...)

Thanks for the suggestion on type parameters! I wanted something that I could quickly prototype with to see how DimensionalData.jl fares with what our application throws at it; but maybe it is best to just use type parameters from the get go.

The other thing is I am already using @kwdef so I didn't want to write another macro unless I could get it to compose well.

@kwdef Data
  variable1::T1 = ReadFromHDF5("/group/variable1")
  variable2::T2 = ReadFromHDF5("/group/variable2")
  variable3::T3 = ReadFromHDF5("/group/variable3")
end

I can probably figure out how to do this though

kdheepak commented 11 months ago

I'll close this issue. I can reopen or create a new issue if I have more questions. Thanks for your swift responses on here!

rafaqz commented 11 months ago

The dimensions of all array layers are not the same unfortunately. 

DimStack is made to handle that. If layers share dimension they have to match, but they dont need to share all or even any dimensions.

kdheepak commented 11 months ago

Ah that makes sense! Thanks for the clarification.