rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
282 stars 42 forks source link

Utilities for managing deeply nested data? #862

Open kapple19 opened 3 days ago

kapple19 commented 3 days ago

TL; DR

I'm looking to simplify code for managing nested DimArrays and DimStacks. E.g. feeding nested data to AlgebraOfGraphics. Maybe a add_dims function? I've prototyped one below that I'm happy to PR.

Motivating Example

My work typically involves exploring nested data structures that are best not laid out as a long-format data table. A (non-obvious) example is visualised in the following plot:

image

where the titles are bearings [deg], the horizontal axis is sound speed [m/s], the vertical axis is depth [m] and the colour is range [m].

And this is still a simplified version of the type of data I manage probably half my days at work.

The data plotted above is generated by the following code:

##
using DimensionalData
using CairoMakie, AlgebraOfGraphics

using DimensionalData: dims

##
function munk_profile(z::Real; ϵ = 7.37e-3)
    z̃ = 2(z/1300 - 1)
    return 1500(
        1 + ϵ * (
            z̃ - 1 + exp(-z̃)
        )
    )
end

rand_munk_profile(z::Real) = munk_profile(z) + 5rand()

##
dimprofile(z, c) = DimArray(c, Dim{:z}(z); name = :c)
dimslice(r, zc) = DimArray(zc, Dim{:r}(r); name = :slice)
dimenvironment(θ, rzc) = DimArray(rzc, Dim{:θ}(θ); name = :environment)

rand_depths() = [0 : 10 : 50; 5e3rand(rand(20:30))] |> sort! |> unique!
rand_ranges() = [0; 10e3rand(rand(2:5))] |> sort! |> unique!

rand_dimprofile() = rand_depths() |> zs -> dimprofile(zs, rand_munk_profile.(zs))
rand_dimslice() = rand_ranges() |> rs -> dimslice(rs, [rand_dimprofile() for _ in rs])
rand_environment() = (45 |> step -> 0 : step : 360-step) |> θs -> dimenvironment(θs, [rand_dimslice() for _ in θs])

env = rand_environment()

I can show the nested structures of env by:

julia> dims(env)
(↓ θ Sampled{Int64} 0:45:315 ForwardOrdered Regular Points)

julia> dims(env[1])
(↓ r Sampled{Float64} [0.0, 3597.4471629977843, 5534.263563590665, 7028.512437622645] ForwardOrdered Irregular Points)

julia> dims(env[1][1])
(↓ z Sampled{Float64} [0.0, 10.0, …, 4977.663346971497, 4990.064674258991] ForwardOrdered Irregular Points)

Each bearing Dim{:θ} has a collection of ranges Dim{:r} whose values aren't shared between each θ. Likewise each Dim{:r} has a collection of depth-sound speed Dim{:z}-Dim{:c} pairs whose z values aren't shared between each r.

(In other words, vector data I guess 😅 but with structures resembling raster data, DimensionalData is the best I've found so far for managing such data.)

I've played around with ways to efficiently and flexibly visualise the above in different ways. My current preference involves the auxiliary function add_dims I've defined, which behaves like reshape but for DimArrays.

function add_dims(dimarray, newdims...)
    alldims = (dims(dimarray)..., newdims...)
    return DimArray(
        reshape(
            dimarray |> parent,
            alldims .|> length
        ),
        alldims;
        name = dimarray.name,
        metadata = dimarray.metadata,
        refdims = dimarray.refdims
    )
end

The idea for add_dims is that when you access a DimArray inside another DimArray, that inner DimArray doesn't hold information on the Dim(s) you used to call it.

I'll define the compass bearing relationships:

lat_cols = ["N"; ""; "S"]
lon_rows = ["W" "" "E"]
compass_dirs = lat_cols .* lon_rows
compass_bearings = 0 : 45 : 360-45
compass_grid = Dict(
    θ => 2 .+ (
        -round(Int, cosd(θ), RoundFromZero),
        round(Int, sind(θ), RoundFromZero),
    )
    for θ in compass_bearings
)
compass_directions = Dict(
    θ => compass_dirs[compass_grid[θ]...]
    for θ in compass_bearings
)

and then plot like so:

draw(
    sum(
        add_dims(
            env[θ = At(θ)][r = At(r)],
            Dim{:θ}([θ]),
            Dim{:r}([r])
        ) |> data
        for θ in 45 |> step -> 0 : step : 360-step
        for r in dims(env[θ = At(θ)], :r) |> reverse
    )
    # * mapping(:c, :z, color = :r, layout = :θ => (θ -> compass_directions[θ]) => presorted)
    * mapping(:c, :z, color = :r, layout = :θ => nonnumeric)
    * visual(Lines),
    scales(;
        Color = (; colormap = Reverse(:blues)),
        Layout = (; palette = [compass_grid[θ] for θ in compass_bearings])
    );
    axis = (; yreversed = true)
)

Okay so, what's your question?

My question is, does anyone know any better ways to manage/handle this data, including interaction with AlgebraOfGraphics? The above is still only a simplified example of the type of data I handle about half by work days. Re AoG I'll be looking into pagination next as well, excited to see what it can do.

If you guys like add_dims I'm happy to PR it with docs and example of course.

I'm partially open to alternatives. I'd like to avoid pregrouped, it makes my soul hurt and in my opinion goes against the whole declarative syntax idea AoG was made for (but I still understand why pregrouped exists).

Are there other packages of storage structures that work better on this type of vector data but allows dimensional names and indexing, including grammar-of-graphics interoperability like what AoG provides?

rafaqz commented 2 days ago

This makes me think refdims should also be table columns (just the same value repeated), so you could be setting refdims instead of reshaping.

It's pretty much what they are meant for - tracking where in a larger array a slice comes from.

We could even add a DimNestedArray object that holds other DimArray/DimStack that updates refdims automatically on indexing?

kapple19 commented 1 day ago

Oh I see. I'll check out refdims then.

This makes me think refdims should also be table columns

Are you saying it doesn't implement the Tables.jl interface so it won't work out-of-the-box with AoG? I'm about to check it out now and see what it can do.

We could even add a DimNestedArray object that holds other DimArray/DimStack that updates refdims automatically on indexing?

I see. From my superficial understanding, DimArray already has the features for nesting multiple layers, but you have an idea for a new object that would work better?

The automatic refdims update upon indexing would indeed simplify the syntax for passing to AoG.data.

I also found eachslice.