rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
269 stars 37 forks source link

Broadcasting over a `DimArray` drops the name field #580

Open JoshuaBillson opened 8 months ago

JoshuaBillson commented 8 months ago

Problem:

I've found that broadcasting over a DimArray causes the name field to be set to Symbol("")

I find this behaviour to be quite unexpected. Moreover, it's difficult to feed a broadcasted array into a table, since the column names depend on the array name.

Minimal Example:

julia> array = DimArray(zeros(2, 2), (:X, :Y), name=:Band_1)
2×2 DimArray{Float64,2} Band_1 with dimensions: Dim{:X}, Dim{:Y}
 0.0  0.0
 0.0  0.0

julia> array.name
:Band_1

julia> (array .* 1.0).name
Symbol("")
rafaqz commented 8 months ago

Yeah, there are for and against for this. I think the question isnt about broadcast but if similar keeps the name or not.

Often the name says something about the data that the broadcast totally changes. And raw similar contains no data and is completely uninitialised. So having a name is a bit odd.

But IDK, I also see your point. Im open to changing it. Maybe try a PR changing similar and see what breaks.

lazarusA commented 6 months ago

You should rename it to the previous name, plus the operation applied.

JoshuaBillson commented 6 months ago

You should rename it to the previous name, plus the operation applied.

The use-case that prompted this issue was converting digital numbers into reflectance for satellite imagery. For example, calculating the surface-reflectance for Sentinel 2 imagery is done as follows:

sentinel .* 0.0001f0

In this case, if a layer has the name :B04, I would like it to retain this after converting it to reflectance. Otherwise, downstream code that depends on the band name will break. The current solution is to rely on rebuild to restore the original name, but this can be quite tedious in practice and is easy to forget about. Layer names could also get long after applying multiple operations in sequence, such as (landsat .* 0.0000275f0) .+ -0.2f0.

rafaqz commented 6 months ago

This is the problem, what to do in this case:

mycustomfunction.(sentinel .* 0.0001f0 .* some_other_dim_array.^2) ./ yet_another_dim_array

Its a broadcast just like yours, and at the level of similar we cant tell which is which.

Doing what you want probably requires putting name rules into our broadcast mechanisms. Its possible, but not easy.

(Or we can let precedence win, maybe that will be predictable enough that users can rearrange their functions so the right name "wins" but even so should this still be called :B04 ?)

JoshuaBillson commented 6 months ago

I suppose that one could use modify in such cases, which will retain the name of the desired array. You could also extract the data of all arrays whose name you don't want to retain, so that only one operand is a DimArray.

Perhaps we should consider that broadcasting directly over a DimArray is a bad idea? As you pointed out, it can be difficult to determine the identity of the output array based solely on the input. It might be better to provide a method that requires users to specify the name of the output.

rafaqz commented 6 months ago

Broadcasting is a huge use case of DimArray and arrays generally in Julia. Its how everything works internally in Rasters.jl - DiskArrays.jl has lazy broadcasting over big data so we leverage this.

We could (should) allow you to specify the name in broadcast_dims which we own - but it doesn't have the nice syntax of base broadcasting.

This is just a hard problem.

We want the best generic native julia syntax, and we want names to propagate intelligently.