locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 362 forks source link

Unexpected no data introduced when multiple geotiffs are read and tiled #3449

Closed SharangC96 closed 2 years ago

SharangC96 commented 2 years ago

Describe the bug

Unexpected no data introduced when multiple geotiffs are read and tiled. no data cells are reduced if tile size is reduced.

To Reproduce

  def stitchLocalGeoTiffs(): Unit = {
    private val sparkConf = new SparkConf()
      .setMaster("local[1]")
      .setAppName("processor")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryoserializer.buffer.max", "2040m")

    val spark: SparkSession = SparkSession
      .builder()
      .config(sparkConf)
      .getOrCreate()

    implicit val sc: SparkContext = spark.sparkContext
    import spark.implicits._

    val layoutScheme = FloatingLayoutScheme(tileSize = 1024)

    val rasterSources: Seq[RasterSource] = getRasterSources("/some/folder/")
    val sourceRDD = sc.parallelize(rasterSources)

    val rasterSummary = RasterSummary.fromRDD(sourceRDD)

    val layoutLevel = rasterSummary.levelFor(layoutScheme)

    println(rasterSummary.toString)

    val multiBandTileLayerRDD = RasterSourceRDD.tiledLayerRDD(
      sourceRDD,
      layoutLevel.layout,
      KeyExtractor.spatialKeyExtractor,
      NearestNeighbor,
      Some(rasterSummary),
      None
    )

    val mtlRDD = multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ t =>
      t.band(0)
    }))

    GeoTiff(mtlRDD.stitch(), mtlRDD.metadata.crs)
      .write(
        "some/location/stitched_individual/stitched.tif"
      )
  }

  def getRasterSources(path: String): Seq[RasterSource] = {
    val d = new File(path)
    val fileSeq = if (d.exists && d.isDirectory) {
      d.listFiles
        .filter(_.isFile)
        .filter(f => f.getName.endsWith(".tif"))
        .toSeq
    } else {
      Seq[File]()
    }

    fileSeq
      .map(f => GeoTiffRasterSource(GeoTiffPath(f.getAbsolutePath))
      )
  }

Inputs: multiple geotiff files with same crs, they all are not same in dimensions one can be 1000x1000 , 1003x1002 , 5x1000 etc

zipped folder containing input geotiffs https://drive.google.com/file/d/19w1FfM82E0PBtvBc1NmVHNXk_0GC5Zoh/view?usp=sharing

tile size 32

image

tile size 64

image

tile size 128

image

tile size 256

image

tile size 512

image

Please note how more and more no data cells are increasing as tile size increases

Stitched tiles should look like this. Please note that this image is from QGIS where I opened all 50 geotiffs together

image

Environment

SharangC96 commented 2 years ago

@pomadchin created an issue as discussed on gitter

pomadchin commented 2 years ago

Hey @SharangC96 thx for reporting! I'll put a bug label for now, since it def looks suspisious.

SharangC96 commented 2 years ago

Hey @pomadchin
I think I have found the issue. This is related to https://github.com/locationtech/geotrellis/issues/3213

For a spatial key Rastersouce.tiledLayerRDD is returning data in different bands.

val multiBandTileLayerRDD = RasterSourceRDD.tiledLayerRDD(
      sourceRDD,
      layoutLevel.layout,
      KeyExtractor.spatialKeyExtractor,
      NearestNeighbor,
      Some(rasterSummary),
      None
    )

multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ mbt =>
      MultibandTile(mbt.combineDouble(_.foldLeft(Double.NaN)((l,r) => if(l.isNaN) r else l)))
    }))

the above code returns the expected result. I am just picking the non NaN value across the bands.

let me know your thoughts

pomadchin commented 2 years ago

Hey @SharangC96 oh nice, than the issues with tiling you see is related to the improperly set NoData in the imagery you have.

pomadchin commented 2 years ago

@SharangC96 Yes, indeed, the NoData is set to -999, is that an expected value?

SharangC96 commented 2 years ago

I didn’t set -999, it is how it is in the geotiff basically

should I try with converting the GeoTiffRasterSource to FloatUserDefinedNoDataCellType(-999) cell type ?

I'll try with it once

pomadchin commented 2 years ago

@SharangC96 hm, yes please, but I'm looking into the code now: https://github.com/locationtech/geotrellis/blob/master/raster/src/main/scala/geotrellis/raster/stitch/Stitcher.scala#L100-L104 and it looks like doesn't take into account NoData values:

def update(colOffset: Int, rowOffset: Int, update: Tile): Unit = {
    if (this.cellType.isFloatingPoint) {
      cfor(0)(_ < update.rows, _ + 1) { r =>
        cfor(0)(_ < update.cols, _ + 1) { c =>
          setDouble(c + colOffset, r + rowOffset, update.getDouble(c, r))
        }
      }
    } else {
      cfor(0)(_ < update.rows, _ + 1) { r =>
        cfor(0)(_ < update.cols, _ + 1) { c =>
          set(c + colOffset, r + rowOffset, update.get(c, r))
        }
      }
    }
  }

Though that's a really good question what really happens on stitch and why in your case it is important to take into account NoDatas, I'd need to take a closer look.

pomadchin commented 2 years ago

@SharangC96 hm, how do you pick not nans across bands? the dataset contains only single banded rasters

SharangC96 commented 2 years ago

yes, the geotiffs contains single band rasters. That's why it was okay for me to pick values across bands.

btw converting to FloatUserDefinedNoDataCellType(-999) didn’t help

pomadchin commented 2 years ago

@SharangC96 what means to pick values across bands, if all TIFFs are singlebanded?

SharangC96 commented 2 years ago

yes, all tiffs are single banded

val multiBandTileLayerRDD = RasterSourceRDD.tiledLayerRDD(
      sourceRDD,
      layoutLevel.layout,
      KeyExtractor.spatialKeyExtractor,
      NearestNeighbor,
      Some(rasterSummary),
      None
    )

multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ mbt =>
      MultibandTile(mbt.combineDouble(_.foldLeft(Double.NaN)((l,r) => if(l.isNaN) r else l)))
    }))

so multiBandTileLayerRDD is returning a multiband tiffs for some spatial keys because of the code here https://github.com/locationtech/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/RasterSourceRDD.scala#L243

I meant the below by "picking values across the bands"

multiBandTileLayerRDD.withContext(rdd => rdd.mapValues({ mbt =>
      MultibandTile(mbt.combineDouble(_.foldLeft(Double.NaN)((l,r) => if(l.isNaN) r else l)))
    }))

or a better way to do the above would be

     val mtlRDD = multiBandTileLayerRDD
      .withContext(rdd => rdd.mapValues({ mbt =>
        MultibandTile(
          bands = mbt.combineDouble(
            _.find(v => !v.isNaN).getOrElse(Double.NaN)
          )
        )
      }))

Just pick the first non nan value

pomadchin commented 2 years ago

@SharangC96 Oh, I see what happens:

TLDR;

  1. This behavior configurable and clear from the API perspective (means API changes / improvements)
  2. The solution you propose is correct in your particular case, the slightly better (api-wise) is to use .reduce(_ merge _):
// merge "merges" chunks correctly, taking into account the NoData set and the underlying cellType
val mtlRDD = multiBandTileLayerRDD.withContext(_.mapValues(_.bands.reduce(_ merge _)))
pomadchin commented 2 years ago

I think at this point it makes sense to convert this issue into the enchancement since this API behavior is 100% not obvious, and definitely not a bug, but some API oversight. Thanks for reporting!

SharangC96 commented 2 years ago

@pomadchin

some of the end up being 2 banded

actually number of bands depend on the grouping on the spatial key like you said, I have even seen 4 bands, and this may increase further if a bigger tile size is chosen

using tile.merge definitely looks much better.

Now Looking a bit further in this

I was thinking what will happen if original geotiiffs that are being read are multibanded ?

let's say if we had 2 geotiff files only with 3 bands each. Say tile size was big enough that only one tile is formed (so that these two geotiffs get grouped on this spatial key)

Wouldn’t RasterSourceRDD.tiledLayerRDD produce 6 bands ?

and now if we reduce these 6 bands by

val mtlRDD = multiBandTileLayerRDD.withContext(_.mapValues(_.bands.reduce(_ merge _)))

it feels incorrect as we have merged all these bands

generally I would expect that RasterSourceRDD.tiledLayerRDD to return with same number (or actually the max number of bands across all the geotiffs) of bands and the merging to happen in the respective band

In terms of code, this is what I mean rewrote this section

    val tiledRDD: RDD[(K, MultibandTile)] =
      rasterRegionRDD
        .groupByKey(partitioner.getOrElse(SpatialPartitioner[K](partitionCount)))
        .mapValues(iter => {
          val multiBandTileIterable = iter.flatMap(_.raster.toSeq.map(_.tile))

          multiBandTileIterable.foldLeft(null: MultibandTile)((acc, mbt) => {
            if (acc == null)
              mbt
            else {
              val tileList: ListBuffer[Tile] = scala.collection.mutable.ListBuffer()
              for (i <- 0 until Math.max(acc.bandCount, mbt.bandCount)) {
                val mergedO =
                  for {
                    x <- acc.bandSafe(i)
                    y <- mbt.bandSafe(i)
                  } yield x.merge(y)
                tileList.append(mergedO.get) //safe to do .get here
              }
              MultibandTile(tileList)
            }
          })
        })
pomadchin commented 2 years ago

@SharangC96 yea, you're right, that can be more in case there are more overlaps.

In case the source rasters are multibands this code also converts imagery into hard to manage multiband multichunked rasters; that's why I talked about your particular case in the comment above.

Thanks for mocking the idea, mb I'll create a quick PR for you to test it out; but indeed, we need merging rasters correctly taking into account bands in the multiband case.

That is still possible to merge approprietly even with the current method, though that is def a workaround:

// all raster sources should always have the same bandCount
val bandCount = rasterSources.head.bandCount
val mtlRDD = multiBandTileLayerRDD.withContext { _.mapValues { mbt =>
  mbt.bands.grouped(bandCount).map(MultibandTile(_)).reduce(_ merge _)
} }
pomadchin commented 2 years ago

I think I remembered, we indeed did that after some discussion, since merging is also not what everyone would like to do. Let's keep it as is, and it is more about the API echancement, from my perspective this should be configurable i.e. via a separate argument passed into the tileToLayout function, which will change the way we produce outputs here.

SharangC96 commented 2 years ago

Yes, I agree. This behaviour should be configurable.

In the current form it is quite problematic though.

pomadchin commented 2 years ago

@SharangC96 🚀 If you have time will to look into the API I'll try to help you with the PR; I'll try to allocate some time this sprint on this issue as well to get it into the upcoming release; this should be resolved 🤦

SharangC96 commented 2 years ago

@pomadchin I'll be happy to do that, this would be my first open source contribution :)