rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
271 stars 38 forks source link

Overhaul DimTable #536

Closed JoshuaBillson closed 11 months ago

JoshuaBillson commented 1 year ago

The current implementation of DimTable has a number of drawbacks. In particular, we need to support the following:

  1. Can handle data with many columns.
  2. Can unfold specified dimensions of an AbstractDimArray into columns.
  3. Can construct an AbstractDimStack from a table.
  4. Compatibility with GeoStats.jl.

To solve this, I've implemented the WideDimTable type. I've also implemented a new column, MergedDimColumn, which wraps two or more DimColumns to produce a tuple of points under the :geometry header. To combine dimensions, we can pass mergedims=true. We can also treat a single dimension of an AbstractDimArray as though it was layers of an AbstractDimStack by passing layersfrom=Dimension. Here's an example of the interface thus far:

julia> dimarray = DimArray(rand(Float32, 512, 512, 5), (X, Y, Dim{:band}));

julia> WideDimTable(dimarray) |> DataFrame
1310720×4 DataFrame
     Row │ X      Y      band   value      
         │ Int64  Int64  Int64  Float32    
─────────┼─────────────────────────────────
       1 │     1      1      1  0.366132
       2 │     2      1      1  0.00185609
       3 │     3      1      1  0.718482
       4 │     4      1      1  0.13569
    ⋮    │   ⋮      ⋮      ⋮        ⋮
 1310717 │   509    512      5  0.78933
 1310718 │   510    512      5  0.929937
 1310719 │   511    512      5  0.455286
 1310720 │   512    512      5  0.252793
                       1310712 rows omitted

julia> WideDimTable(dimarray, layersfrom=Dim{:band}) |> DataFrame
262144×7 DataFrame
    Row │ X      Y      band_1      band_2     band_3      band_4     band_5    
        │ Int64  Int64  Float32     Float32    Float32     Float32    Float32   
────────┼───────────────────────────────────────────────────────────────────────
      1 │     1      1  0.366132    0.466749   0.340366    0.459397   0.17998
      2 │     2      1  0.00185609  0.219978   0.00102282  0.729957   0.295339
      3 │     3      1  0.718482    0.0594266  0.80399     0.604377   0.076027
      4 │     4      1  0.13569     0.326482   0.464005    0.406371   0.227705
   ⋮    │   ⋮      ⋮        ⋮           ⋮          ⋮           ⋮          ⋮
 262141 │   509    512  0.922686    0.0885217  0.616942    0.292217   0.78933
 262142 │   510    512  0.960718    0.931053   0.269816    0.220552   0.929937
 262143 │   511    512  0.843416    0.339335   0.0562063   0.93825    0.455286
 262144 │   512    512  0.165067    0.60916    0.878569    0.731508   0.252793
                                                             262136 rows omitted

julia> WideDimTable(dimarray, layersfrom=Dim{:band}, mergedims=true) |> DataFrame
262144×6 DataFrame
    Row │ geometry    band_1      band_2     band_3      band_4     band_5    
        │ Tuple…      Float32     Float32    Float32     Float32    Float32   
────────┼─────────────────────────────────────────────────────────────────────
      1 │ (1, 1)      0.366132    0.466749   0.340366    0.459397   0.17998
      2 │ (2, 1)      0.00185609  0.219978   0.00102282  0.729957   0.295339
      3 │ (3, 1)      0.718482    0.0594266  0.80399     0.604377   0.076027
      4 │ (4, 1)      0.13569     0.326482   0.464005    0.406371   0.227705
   ⋮    │     ⋮           ⋮           ⋮          ⋮           ⋮          ⋮
 262141 │ (509, 512)  0.922686    0.0885217  0.616942    0.292217   0.78933
 262142 │ (510, 512)  0.960718    0.931053   0.269816    0.220552   0.929937
 262143 │ (511, 512)  0.843416    0.339335   0.0562063   0.93825    0.455286
 262144 │ (512, 512)  0.165067    0.60916    0.878569    0.731508   0.252793
                                                           262136 rows omitted
rafaqz commented 1 year ago

This looks great. I'm wondering about the relationship of DimTable and WideDimeTable. Its good that we keep both for backwards compat and type stability reasons.

But maybe we should as much as possible keep the interface the same, and add layersfrom=nothing, mergedims=false keywords to DimTable as well?

JoshuaBillson commented 1 year ago

This looks great. I'm wondering about the relationship of DimTable and WideDimeTable. Its good that we keep both for backwards compat and type stability reasons.

But maybe we should as much as possible keep the interface the same, and add layersfrom=nothing, mergedims=false keywords to DimTable as well?

I mostly kept DimTable to benchmark it against WideDimTable. If we kept both, under what conditions would we use one over the other? Are we planning on exposing an interface for users to explicitly choose?

rafaqz commented 1 year ago

Im happy to merge them, but we need to keep something called DimTable.

Im also not used to this PR style of adding new code then switching it out layer. The downside is not seeing the specific changes being made line by line.

If you want to remove tge current DimTable this code needs to diff it pretty cleanly.

JoshuaBillson commented 1 year ago

I've implemented an unmergedims method for undoing the effects of mergedims on both AbstractDimArray and AbstractDimStack. This should get us a good way towards restoring a stack or array from a table.

julia> dimarray = DimArray(rand(Float32, 512, 512, 5), (X, Y, Dim{:band}));

julia> mergedarray = mergedims(dimarray, (X,Y)=>:geometry)
5×262144 DimArray{Float32,2} with dimensions: 
  Dim{:band},
  Dim{:geometry} MergedLookup{Tuple{Int64, Int64}} Tuple{Int64, Int64}[(1, 1), (2, 1), …, (511, 512), (512, 512)] X, Y
  (1, 1)    (2, 1)    (3, 1)      (4, 1)    (5, 1)    (6, 1)    (7, 1)     (8, 1)    …   (508, 512)   (509, 512)   (510, 512)   (511, 512)   (512, 512)
 0.408622  0.886254  0.0976083   0.923665  0.425342  0.776444  0.171379   0.0781628     0.892641     0.0971096    0.276245     0.858681     0.405161
 0.899253  0.418132  0.0659314   0.339008  0.980397  0.496013  0.0799317  0.883041      0.873571     0.70769      0.808338     0.316129     0.994636
 0.469823  0.410051  0.00577027  0.512094  0.368832  0.216379  0.978313   0.763036      0.715357     0.340662     0.561662     0.166867     0.482182
 0.909567  0.287847  0.373506    0.121887  0.103681  0.52259   0.726426   0.262001      0.537583     0.392537     0.907184     0.57249      0.0316634
 0.141684  0.509818  0.94262     0.606817  0.172232  0.548815  0.32792    0.898369   …  0.531061     0.248262     0.758778     0.493107     0.10552

julia> unmerged = unmergedims(mergedarray, dims(dimarray))
512×512×5 DimArray{Float32,3} with dimensions: X, Y, Dim{:band}
[:, :, 1]
 0.408622   0.1928     0.363784   0.20637    0.0623429  0.540009  0.870785   0.6065      …  0.380542   0.708444   0.621855   0.654291   0.307949    0.583784
 0.886254   0.222856   0.810224   0.247393   0.164376   0.529767  0.0276215  0.449118       0.309448   0.0370631  0.0928274  0.727497   0.95558     0.579522
 0.0976083  0.47824    0.0147222  0.642226   0.9596     0.127347  0.137955   0.247403       0.678952   0.43045    0.582117   0.532908   0.219411    0.786123
 0.923665   0.393615   0.434965   0.946112   0.894384   0.950089  0.50744    0.408371       0.0949435  0.265479   0.585809   0.624134   0.755077    0.373703
 0.425342   0.220184   0.862294   0.944012   0.75059    0.755926  0.486645   0.838939       0.802087   0.712398   0.581404   0.476088   0.514338    0.513238
 0.776444   0.27527    0.940428   0.425444   0.5388     0.226998  0.197969   0.376402    …  0.948172   0.157085   0.702399   0.0637432  0.579163    0.123023
 0.171379   0.643685   0.814836   0.162074   0.205144   0.631337  0.651405   0.860217       0.583075   0.807928   0.618025   0.40905    0.0272511   0.418449
 0.0781628  0.318983   0.85593    0.493423   0.456952   0.163325  0.214463   0.892753       0.0738822  0.376432   0.018835   0.664184   0.16614     0.0338068
 0.895283   0.428974   0.66919    0.743919   0.402517   0.699578  0.166144   0.504705       0.471096   0.526914   0.256205   0.669583   0.979642    0.713561
 0.607489   0.676163   0.337688   0.595755   0.261997   0.370041  0.994099   0.87605        0.61286    0.617319   0.0632272  0.753327   0.00494283  0.0277535
 ⋮                                                      ⋮                                ⋱                                              ⋮           
 0.505413   0.932647   0.466523   0.176984   0.519687   0.879219  0.0315067  0.453979    …  0.357632   0.643493   0.867573   0.369308   0.908243    0.41651
 0.433541   0.176856   0.982449   0.620943   0.874699   0.85108   0.134078   0.00542814     0.215931   0.588794   0.308641   0.174514   0.492008    0.27178
 0.366505   0.224207   0.907218   0.903873   0.491976   0.980417  0.463662   0.931333       0.138668   0.0722253  0.0868067  0.826775   0.812139    0.441321
 0.235435   0.286471   0.402834   0.830526   0.594737   0.383988  0.312103   0.348005       0.10901    0.940322   0.692892   0.81595    0.967958    0.677501
 0.850423   0.586668   0.253597   0.274876   0.997255   0.576228  0.0167348  0.0360385      0.587654   0.626096   0.952066   0.645478   0.482546    0.379
 0.477778   0.698586   0.237578   0.646602   0.717585   0.644979  0.729319   0.622078    …  0.684544   0.530124   0.92819    0.947521   0.971794    0.892641
 0.52555    0.299894   0.0815916  0.0887221  0.982812   0.896342  0.659178   0.416482       0.701961   0.518297   0.864392   0.403778   0.528058    0.0971096
 0.135391   0.0488768  0.390891   0.467249   0.559043   0.609367  0.404012   0.843719       0.602538   0.0817617  0.0658015  0.943544   0.0492677   0.276245
 0.502091   0.0233992  0.843368   0.827739   0.210922   0.142626  0.693656   0.524975       0.177573   0.25404    0.256602   0.493337   0.823688    0.858681
 0.33225    0.0327858  0.743641   0.475116   0.412457   0.970831  0.504166   0.0251963      0.462065   0.877883   0.299307   0.44449    0.986187    0.405161
[and 4 more slices...]

julia> all(dimarray .== unmerged)
true
rafaqz commented 1 year ago

We might want to rebase this against #477 so the mergedims code is attributed to @sethaxen and the other changes are included

codecov-commenter commented 1 year ago

Codecov Report

Merging #536 (59f10a7) into main (4ce2294) will decrease coverage by 6.52%. Report is 81 commits behind head on main. The diff coverage is 86.91%.

@@            Coverage Diff             @@
##             main     #536      +/-   ##
==========================================
- Coverage   89.64%   83.13%   -6.52%     
==========================================
  Files          39       40       +1     
  Lines        2743     3297     +554     
==========================================
+ Hits         2459     2741     +282     
- Misses        284      556     +272     
Files Changed Coverage Δ
src/DimensionalData.jl 100.00% <ø> (ø)
src/Dimensions/Dimensions.jl 100.00% <ø> (ø)
src/LookupArrays/LookupArrays.jl 100.00% <ø> (ø)
src/interface.jl 100.00% <ø> (ø)
src/plotrecipes.jl 82.57% <0.00%> (-1.93%) :arrow_down:
src/set.jl 94.44% <ø> (ø)
src/stack/methods.jl 89.47% <0.00%> (-6.69%) :arrow_down:
src/array/show.jl 90.74% <25.00%> (-4.36%) :arrow_down:
src/LookupArrays/indexing.jl 75.00% <50.00%> (+15.00%) :arrow_up:
src/stack/show.jl 97.50% <66.66%> (-2.50%) :arrow_down:
... and 14 more

... and 6 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

rafaqz commented 12 months ago

Is this one ready to merge? Would be good to get it in soon for the breaking release (I'm assuming it is slightly breaking for Dimtable? If not take your time.)

JoshuaBillson commented 12 months ago

Is this one ready to merge? Would be good to get it in soon for the breaking release (I'm assuming it is slightly breaking for Dimtable? If not take your time.)

I've finished adding new test cases and updated the docs. We're passing all the original test cases, so I don't think there should be any breaking changes. Nonetheless, it would be best to do a breaking release just to be safe.

I still need to implement a method for restoring an AbstractDimArray or AbstractDimStack from a table, but we can probably make that a separate PR.

rafaqz commented 11 months ago

Ok to merge?

JoshuaBillson commented 11 months ago

Ok to merge?

Yes, I think we can merge.

rafaqz commented 11 months ago

Thanks. Great PR too, very much appreciated.