rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
262 stars 38 forks source link

Materialize `DimArray` or `DimStack` From a Table #739

Open JoshuaBillson opened 2 weeks ago

JoshuaBillson commented 2 weeks ago

Description:

resolves #335

This PR aims to let users construct either a DimStack or DimArray from a table with one or more coordinate columns.

Unlike the existing contructor, rows may be out of order or even missing altogether.

Performance:

The algorithm is O(n), requiring two forward passes for each dimension to determine the correct order of rows.

1000x1000: 0.005 Seconds

2000x2000: 0.025 Seconds

4000x4000: 0.108 Seconds

8000x8000: 0.376 Seconds

Example:

julia> r = DimArray(rand(UInt8, 1000, 1000), (X, Y));

julia> t = r |> DataFrame |> Random.shuffle!;

julia> restored = DimArray(t, dims(r));

julia> all(r .== restored)
true

Next Steps:

  1. Finalize method signatures.
  2. Decide what to do with missing rows. Currently, users may choose a missing value to fill in (missing by default).
  3. Add support for a :geometry column.
  4. Write test cases.
  5. Update docs.
JoshuaBillson commented 2 weeks ago

There's still a few questions that need resolving:

  1. What do we do with missing rows? This can be handled explicitly in Rasters.jl, but DimArrays have no concept of missing values. We could let users choose a missing value to write, or we could encode it explicitly with missing. We could also disallow the existence of missing rows, but I don't think this is an ideal solution.
  2. How should we handle a :geometry column? I know that several packages in the Julia ecosystem are using this convention, but I don't have much experience with them. Can we assume :geometry will contain tuples of coordinates, or could they also be some sort of geometry like Meshes.Point?
asinghvi17 commented 2 weeks ago

I think it's best to test that the geometry column's elements have GI.PointTrait - then you can always extract e.g. GI.x(point) and the same for y, z, and m. You can also interrogate the dimension via GI.is3d, GI.ismeasured.

Going forward to get the geometry column it's probably best to have first(GI.geometrycolumns(table)) - the fallback implementation will give you (:geometry,), but to handle tables with other geometry columns which may be indicated by e.g. metadata, it's better to use this.

rafaqz commented 2 weeks ago

DimensionalData.jl does not depend on GeoInterface

JoshuaBillson commented 1 week ago

DimensionalData.jl does not depend on GeoInterface

Being able to interact with geometries in a generic fashion would help us interop with other packages. GeoInterface's only dependency is Extents, which we already depend on. After importing DimensionalData, GeoInterface loads in 0.008 seconds.

I also see that Rasters already depends on GeoInterface. We could simply ignore the :geometry column in DimensionalData, then implement the additional functionality in Rasters. However, I think the problem is general enough to be implemented here.

rafaqz commented 1 week ago

It's not about deps or timing, it's about clean feature scope.

The promise here is that Rasters (and YAX) have all the geo deps and features, so non-geo people don't have to worry about them. Some of the biggest contributors here use DD for unrelated fields.

I would ignore point columns here entirely and instead write the code so it's easy for Rasters to handle them. Taking the underscores off a few functions and documenting them as an real interface will help that.