Closed SharangC96 closed 2 years ago
@pomadchin created an issue as discussed on gitter
Hey @SharangC96 thx for reporting! I'll put a bug label for now, since it def looks suspisious.
Hey @pomadchin
I think I have found the issue. This is related to https://github.com/locationtech/geotrellis/issues/3213
For a spatial key Rastersouce.tiledLayerRDD
is returning data in different bands.
val multiBandTileLayerRDD = RasterSourceRDD.tiledLayerRDD(
sourceRDD,
layoutLevel.layout,
KeyExtractor.spatialKeyExtractor,
NearestNeighbor,
Some(rasterSummary),
None
)
multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ mbt =>
MultibandTile(mbt.combineDouble(_.foldLeft(Double.NaN)((l,r) => if(l.isNaN) r else l)))
}))
the above code returns the expected result. I am just picking the non NaN value across the bands.
let me know your thoughts
Hey @SharangC96 oh nice, than the issues with tiling you see is related to the improperly set NoData in the imagery you have.
@SharangC96 Yes, indeed, the NoData is set to -999, is that an expected value?
I didn’t set -999, it is how it is in the geotiff basically
should I try with converting the GeoTiffRasterSource
to FloatUserDefinedNoDataCellType(-999)
cell type ?
I'll try with it once
@SharangC96 hm, yes please, but I'm looking into the code now: https://github.com/locationtech/geotrellis/blob/master/raster/src/main/scala/geotrellis/raster/stitch/Stitcher.scala#L100-L104 and it looks like doesn't take into account NoData values:
def update(colOffset: Int, rowOffset: Int, update: Tile): Unit = {
if (this.cellType.isFloatingPoint) {
cfor(0)(_ < update.rows, _ + 1) { r =>
cfor(0)(_ < update.cols, _ + 1) { c =>
setDouble(c + colOffset, r + rowOffset, update.getDouble(c, r))
}
}
} else {
cfor(0)(_ < update.rows, _ + 1) { r =>
cfor(0)(_ < update.cols, _ + 1) { c =>
set(c + colOffset, r + rowOffset, update.get(c, r))
}
}
}
}
Though that's a really good question what really happens on stitch and why in your case it is important to take into account NoDatas, I'd need to take a closer look.
@SharangC96 hm, how do you pick not nans across bands? the dataset contains only single banded rasters
yes, the geotiffs contains single band rasters. That's why it was okay for me to pick values across bands.
btw converting to FloatUserDefinedNoDataCellType(-999)
didn’t help
@SharangC96 what means to pick values across bands, if all TIFFs are singlebanded?
yes, all tiffs are single banded
val multiBandTileLayerRDD = RasterSourceRDD.tiledLayerRDD(
sourceRDD,
layoutLevel.layout,
KeyExtractor.spatialKeyExtractor,
NearestNeighbor,
Some(rasterSummary),
None
)
multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ mbt =>
MultibandTile(mbt.combineDouble(_.foldLeft(Double.NaN)((l,r) => if(l.isNaN) r else l)))
}))
so multiBandTileLayerRDD
is returning a multiband tiffs for some spatial keys because of the code here https://github.com/locationtech/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/RasterSourceRDD.scala#L243
I meant the below by "picking values across the bands"
multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ mbt =>
MultibandTile(mbt.combineDouble(_.foldLeft(Double.NaN)((l,r) => if(l.isNaN) r else l)))
}))
or a better way to do the above would be
val mtlRDD = multiBandTileLayerRDD
.withContext(rdd => rdd.mapValues({ mbt =>
MultibandTile(
bands = mbt.combineDouble(
_.find(v => !v.isNaN).getOrElse(Double.NaN)
)
)
}))
Just pick the first non nan value
@SharangC96 Oh, I see what happens:
FloatingLayoutScheme
(that's what tiledLayerRDD
does)
TLDR;
.reduce(_ merge _)
:// merge "merges" chunks correctly, taking into account the NoData set and the underlying cellType
val mtlRDD = multiBandTileLayerRDD.withContext(_.mapValues(_.bands.reduce(_ merge _)))
I think at this point it makes sense to convert this issue into the enchancement
since this API behavior is 100% not obvious, and definitely not a bug, but some API oversight. Thanks for reporting!
@pomadchin
some of the end up being 2 banded
actually number of bands depend on the grouping on the spatial key like you said, I have even seen 4 bands, and this may increase further if a bigger tile size is chosen
using tile.merge
definitely looks much better.
Now Looking a bit further in this
I was thinking what will happen if original geotiiffs that are being read are multibanded ?
let's say if we had 2 geotiff files only with 3 bands each. Say tile size was big enough that only one tile is formed (so that these two geotiffs get grouped on this spatial key)
Wouldn’t RasterSourceRDD.tiledLayerRDD
produce 6 bands ?
and now if we reduce these 6 bands by
val mtlRDD = multiBandTileLayerRDD.withContext(_.mapValues(_.bands.reduce(_ merge _)))
it feels incorrect as we have merged all these bands
generally I would expect that RasterSourceRDD.tiledLayerRDD
to return with same number (or actually the max number of bands across all the geotiffs) of bands and the merging to happen in the respective band
In terms of code, this is what I mean rewrote this section
val tiledRDD: RDD[(K, MultibandTile)] =
rasterRegionRDD
.groupByKey(partitioner.getOrElse(SpatialPartitioner[K](partitionCount)))
.mapValues(iter => {
val multiBandTileIterable = iter.flatMap(_.raster.toSeq.map(_.tile))
multiBandTileIterable.foldLeft(null: MultibandTile)((acc, mbt) => {
if (acc == null)
mbt
else {
val tileList: ListBuffer[Tile] = scala.collection.mutable.ListBuffer()
for (i <- 0 until Math.max(acc.bandCount, mbt.bandCount)) {
val mergedO =
for {
x <- acc.bandSafe(i)
y <- mbt.bandSafe(i)
} yield x.merge(y)
tileList.append(mergedO.get) //safe to do .get here
}
MultibandTile(tileList)
}
})
})
@SharangC96 yea, you're right, that can be more in case there are more overlaps.
In case the source rasters are multibands this code also converts imagery into hard to manage multiband multichunked rasters; that's why I talked about your particular case in the comment above.
Thanks for mocking the idea, mb I'll create a quick PR for you to test it out; but indeed, we need merging rasters correctly taking into account bands in the multiband case.
That is still possible to merge approprietly even with the current method, though that is def a workaround:
// all raster sources should always have the same bandCount
val bandCount = rasterSources.head.bandCount
val mtlRDD = multiBandTileLayerRDD.withContext { _.mapValues { mbt =>
mbt.bands.grouped(bandCount).map(MultibandTile(_)).reduce(_ merge _)
} }
I think I remembered, we indeed did that after some discussion, since merging is also not what everyone would like to do. Let's keep it as is, and it is more about the API echancement, from my perspective this should be configurable i.e. via a separate argument passed into the tileToLayout
function, which will change the way we produce outputs here.
Yes, I agree. This behaviour should be configurable.
In the current form it is quite problematic though.
@SharangC96 🚀 If you have time will to look into the API I'll try to help you with the PR; I'll try to allocate some time this sprint on this issue as well to get it into the upcoming release; this should be resolved 🤦
@pomadchin I'll be happy to do that, this would be my first open source contribution :)
Describe the bug
Unexpected no data introduced when multiple geotiffs are read and tiled. no data cells are reduced if tile size is reduced.
To Reproduce
Inputs: multiple geotiff files with same crs, they all are not same in dimensions one can be 1000x1000 , 1003x1002 , 5x1000 etc
zipped folder containing input geotiffs https://drive.google.com/file/d/19w1FfM82E0PBtvBc1NmVHNXk_0GC5Zoh/view?usp=sharing
tile size 32
tile size 64
tile size 128
tile size 256
tile size 512
Please note how more and more no data cells are increasing as tile size increases
Stitched tiles should look like this. Please note that this image is from QGIS where I opened all 50 geotiffs together
Environment