locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
244 stars 45 forks source link

SetNoDataValue performance regression #568

Closed pomadchin closed 3 years ago

pomadchin commented 3 years ago

In terms of prepearing #567 (notebook) I notcied that:

# any dataframe with rasters
rs = spark.read.raster(assets.limit(1), tile_dimensions=(512, 512), buffer_size=2, catalog_col_names=["band"])

# Set LC 8 NoData to zero
rsnd = rs.select(rf_with_no_data(rs.band, 0).alias("band"))

# save a hillshade raster to the disk as a tiff (nodata is set via rf_with_no_data)
rsnd \
  .limit(1) \
  .select(rf_hillshade(rsnd.band, azimuth=315, altitude=45, z_factor=1, target="data")) \
  .write.geotiff("lc8-hillshade.tiff", "EPSG:32718")

# save a hillshade raster to the disk as a tiff (without nodata set)
rs \
  .limit(1) \
  .select(rf_hillshade(rs.band, azimuth=315, altitude=45, z_factor=1)) \
  .write.geotiff("lc8-hillshade-all.tiff", "EPSG:32718")```

The usage of rf_with_no_data makes computation ~x600 slower than without it.

rf_with_no_data

image

vanilla

image

pomadchin commented 3 years ago

Ah, that is due to the fact that in this case the nodata tile was set to ~300 tiles which forced deserialization, setting the nodata value and the follow up serialization of all tiles. 🤦

Results with the correclty set limits (two top rows (13, 12) are with the rf_with_no_data used, two bottom (11, 10) - without):

image

Closing it for now.