rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
271 stars 38 forks source link

Overhaul Tables.jl Interface #535

Closed JoshuaBillson closed 6 months ago

JoshuaBillson commented 1 year ago

The current implementation of DimTable has a number of drawbacks. In particular, we need to support the following:

  1. Can handle data with many columns.
  2. Can unfold specified dimensions of an AbstractDimArray into columns.
  3. Can construct an AbstractDimStack from a table.
  4. Compatibility with GeoStats.jl.
rafaqz commented 1 year ago

1-3 are all very good to have.

GeoStats.jl compatability wont happen here specifically, but if there is some generic shared way of getting that we can make it happen. What exactly doesn't work currently ? Is it just getting dimension values as Tuple points?

Do you mean GeoInterface.jl compatibility?

JoshuaBillson commented 1 year ago

Currently, the table interface for AbstractDimStack will output a table with one column for each dimension instead of a single :geometry column consisting of Meshes.jl geometries. This can be fixed by passing the table into georef with a tuple specifying each dimension, but the performance seems to be pretty bad. For example:

julia> @time georef(some_table, (:X, :Y));
 26.887305 seconds (247.80 M allocations: 6.462 GiB, 59.36% gc time)

Compare that to constructing the geometries directly:

julia> @time Meshes.Point.(some_table.X, some_table.Y);
  0.518585 seconds (4 allocations: 945.284 MiB, 42.80% gc time)

We also want some way of taking a GeoStats compatible table and turning it back into an AbstractDimStack. One solution would be to create an extension on GeoTables.jl with specialized methods for dimensional arrays and stacks. For example, we could specialize georef on AbstractDimStack and AbstractDimArray. This code would be loaded whenever the user imports GeoTables, so there would be no penalty to users who don't need that functionality.

rafaqz commented 1 year ago

Yes, :geometry as a column name is the standard we made for GeoInterface.jl, GeoStats.jl is just using it along with everything else geo. Following GeoInterface means you could equally write the output with Shapefile.jl. Which is what we want, rather than extensions being needed.

But, even GeoInterface.jl is too field specific to be supported here directly. ArviZ.jl people do not want that column name :geometry in their tables. So what we can do is add a keyword to DimTable that forces dimension columns to be merged into a Tuple:

DimTable(A; dimcolumn=:geometry)

Where :geometry could be whatever column people want in other fields.

Then in Rasters.jl, where we are all about geo and already depend on GeoInterface.jl, we can add a DimTable(x::Union{<:Raster,<:RasterStack}; dimcolumn=:geometry) method that adds this keyword, allowing user overrides the other direction. Or add a RasterTable in Rasters.jl that essentially just wraps the functionality here with this behavior.

JoshuaBillson commented 1 year ago

That sounds like a good idea. However, rather than specializing DimTable, maybe we could define a more extensible constructor that takes keyword arguments, then specialize Tables.columns on the type. So for a RasterStack we could do something like:

Tables.columns(x::AbstractRasterRack) = DimTable(x; dimcolumn=:geometry)

While AbstractDimStack and AbstractDimArray would look like this:

Tables.columns(x::Union{AbstractDimStack,AbstractDimArray}) = DimTable(x)
rafaqz commented 1 year ago

Totally, that's even better.

Thinking about it we may actually need to use something like dimcolumn=(:X, :Y)=>:geometry in Rasters.jl to avoid getting time and other dimensions in the geometry column.

Proabaly most of the code for this kind of thing is already written for MergedLookup, and here: https://github.com/rafaqz/DimensionalData.jl/pull/477

rafaqz commented 6 months ago

This is done except 3, which is a dup of #327