rafaqz / DimensionalData.jl

Named dimensions and indexing for julia arrays and other data
https://rafaqz.github.io/DimensionalData.jl/stable/
MIT License
271 stars 38 forks source link

Check that lookup properties match in broadcasts #517

Open rafaqz opened 1 year ago

rafaqz commented 1 year ago

@sethaxen and @felixcremer you may have opinions on this, as like cat this may break some downstream code.

I'm hitting annoying issues broadcasting rasters, some with the Y dimension reversed, some without (reversed Y is very common in spatial data). I shouldn't be able to broadcast them together when the order is reversed, but I can, because only the dimension type is checked with comparedims during broadcasts. When there is a lot of data handling code it can be hard to track down where this happens.

Checking order before broadcasting would reduce the potential for these bugs. But as broadcast can't have options passed to it easily, we need to make sure any changes to it are not debilitating for anyone.

So I am proposing:

  1. Prevent any broadcast where one dimension is ForwardOrdered and the other is ReverseOrdered. This doesn't seem controversial to me, and is a compile time comparison. It would be achieved with an order=true keyword to comparedims.

Other possibilities that also make sense:

  1. Prevent broadcasts between any Ordered and Unordered - this could be assumed with the order keyword, so we need to decide clearly either way.
  2. Prevent broadcasts between Categorical and Sampled. Probably this never makes sense to do.
  3. Prevent broadcasts between Regular and Irregular. Probably this also never makes sense to do, and catches different valued lookups without needing a runtime check.

A stronger check that I'm hesitant to do:

  1. Prevent broadcasts where the lookup values do not match and have the same Locus. This one has a runtime cost, and potentially more serious problem where floating point error in ranges converted to vectors causes false == results in comparisons. That would be pretty annoying, so I'm hesitant to do this. Possibly isapprox would be more appropriate, but again there is no way for users to set the allowed tolerance. With Regular lookups we could also just check that the start/stop values match to reduce the run-time cost, with Irregular we would need to check all values to get the same guarantee.

Things that should work but need better return type rules

Currently broadcast return types are determined by whichever is found first. Instead we need clear rules.

  1. Broadcasts between Points and Intervals - should always return Points ? We can assume the interval value is at least a good approximation for the point, but we do not know if the point value is a good approximation for the whole interval.
  2. Broadcasts between NoLookup and Sampled or Categorical should always return Sampled or Categorical ?

To be clear everything should always be able to broadcast with NoLookup if the dimension itself matches.

Also if users want to break any of these rules, they can broadcast over parent(dimarray) for the second/third etc array and get the current behavior.

rafaqz commented 1 year ago

Thinking about this we should probably start by adding a warning for these things.

sethaxen commented 1 year ago

Thinking about this we should probably start by adding a warning for these things.

Agreed with this! RE options 1-4, I'm still thinking through whether these would cause any problems for my use cases. I agree option 5 seems too extreme.

felixcremer commented 4 months ago

I am currently working through the xarray 45 minute tutorial and realized, that in DimensionalData we would happily broadcast two arrays together which happen to have the same size in the dimensions with totally different dimension values. Could we make it so, that we could specify the behaviour globally in some dictionary so that people could change the default behaviour without having to specify it on every broadcast.

I think we could have a global Dict from which we would get the keyword arguments for the comparedims function so that we could change that from the user side. Similar to the setup of the YAXDefaults in YAXArrays https://github.com/JuliaDataCubes/YAXArrays.jl/blob/b4bf44866fbd9c5fe1a78f42d3ff0b454a4102a2/src/YAXArrays.jl#L12C1-L20C2

rafaqz commented 4 months ago

Yeah this has been a thought for a while. Currently we check the name and order but not the values. And sometimes you don't want to check the values (e.g. with NoLookup or with slight floating point differences).

So the global setting makes sense, it would be hard to incorporate into broadcast syntax.

It may be faster to have a Ref{NamedTuple{names,Tuple{Bool,Bool,etc}}} instead of a Dict so that access is faster, as we will always have the same set of options.

(We could also allow "error levels" like allow, warn, and error for when problems are found)