locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 360 forks source link

`CellType.union` is not commutative for "raw" cell types #3154

Open metasim opened 4 years ago

metasim commented 4 years ago

Context: https://gitter.im/geotrellis/geotrellis?at=5dcd9656c26e8923c424ff3a

If one assumes that union should be commutative (as is the Set union operator) this test fails:

UShortConstantNoDataCellType.union(UShortCellType)should be(UShortCellType.union(UShortConstantNoDataCellType))
pomadchin commented 4 years ago

@metasim do you think that the combine should always return a ‘raw’ type (by that I mean combining a type without no data defined with a type that has no data would be the type with out no data - in your example it would be just ShortCellType)? Or what is the conclusion?

metasim commented 4 years ago

@pomadchin : I'm probably not the right person to ask. @vpipkt is the "real" data scientist that found this, working with data products where this is relevant/important, and can better speak to the proper, scientifically sound semantics. I'd lean toward the non-raw type for usability, but can see how that would introduce interpretation into the data that wasn't intended, at least for integral types. I'll let @vpipkt chime in lest I muddy the waters.

vpipkt commented 4 years ago

My thought is:

1) we worked around the behavior we were seeing. 2) the confusing bit for me was in doing an op that combines two tiles with such similar types, the returned tile has the right hand side's cell type because of this

I also see in the other combine calls, the LHS/RHS may be inverted.

I am not really sure what the "union" should be. But based on the code paths there, it seems to perhaps an insufficient consideration just to go with the union. Hopefully that makes sense.

metasim commented 4 years ago

Given that, I think the question at hand is what experience do we want the user to have when operations use union to determine output cell type. As shorthand, I'll use cell type names int8 and int8raw as a proxy for the more general "has a NoData value" and "does not have a NoData value".

Options seem to be (assuming commutativity):

  1. int8.union(int8raw) => int8
    • Pro: For products like L8 where the NoData value is 0, but not encoded in the header (and therefore "raw") things like masking, statistics, local opts, etc. will behave correctly without the user having to understand the oddities of the data product
    • Con: For non-L8 data, we're assuming 0 doesn't have meaning. I could image a categorical target label raster using 0 as a category, and not setting a NoData value because it's not a common consideration. Such an error may be hard to detect/debug.
  2. int8.union(int8raw) => int8raw
    • Pro: No chance of inadvertently marking valid data as missing data.
    • Con: Novice users may be bewildered by the resultant output (we've experienced this a lot with L8), and would have a really hard time tracking down what they need to do to get things like maksing to work.
  3. int8.union(int8raw) and int8raw.union(int8) throw some Exception
    • Pro: Stops analysis in the tracks, refusing to do an ambiguous operation
    • Con: Still requires novice users to understand something about cell types to fix issue (good error message may help). Only detected at runtime, and may terminate an important or long running job.
pomadchin commented 4 years ago

@metasim 3. is definitely a bad behavior. (I think we need to avoid introducing new 'unpure' functions into our codebase; it is bad even in the modern Java)

pomadchin commented 4 years ago

@metasim @vpipkt have you come up with some 'union' function that works for your use case?

vpipkt commented 4 years ago

I have not. As stated above, I have worked around it, basically ensuring that the function I'm writing will always return my desired cell type by doing a binary type operation on the RHS operand.

metasim commented 4 years ago

I vote for option 1.

esmeetu commented 4 years ago

i have a issue about ArrayTile.combine, maybe a little bit related this issue. there is sample code

val c = a.convert(IntCellType).combine(b) {...}

as a is ShortConstantNoDataArrayTile with 16bits cellType ShortConstantNoDataCellType, b is CroppedTile(ShortUserDefinedNoDataArrayTile) with 16bits cellType ShortUserDefinedNoDataCellType. After convert, it will be IntRawArrayTile.combine(b) {...} which IntRawArrayTile contains 32bits IntCellType. i want get a c with IntCellType but get a ShortUserDefinedNoDataArrayTile. Problem is there is not union process in CroppedTile.combine. i have no idea.

pomadchin commented 4 years ago

hey @esmeetu can you clarify a bit what is going on?

val c = a.convert(IntCellType).combine(b) {...} what is the cellType of c in this case? How do you get ShortUserDeinedNoData?

According to this code it uses self.cellType (in your case of a) to derive the cellType of c.

pomadchin commented 4 years ago

@esmeetu you're also welcome to join our gitter channel to talk about it more or file a separate issue as it seems to be unrelated to this particular one.

esmeetu commented 4 years ago

@pomadchin thanks, open a new issue in #3170