Important fix proposal: Implement setting "nan" nodata value for floating point rasters with no nodata attribute

opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time

http://www.opendatacube.org

Apache License 2.0

493 stars 175 forks source link

Important fix proposal: Implement setting "nan" nodata value for floating point rasters with no nodata attribute #1595

Open robbibt opened 4 weeks ago

robbibt commented 4 weeks ago

Hi all, as discussed internally within DEA, we have recently encountered an issue where floating point rasters are not displayed correctly in ESRI software due to the lack of an explicitly set nodata attribute of "nan". This differs from open software such as QGIS and GDAL, which correctly interpret nodata by implicitly treating a missing nodata attribute as equivalent to "nan" (edit: this turns out to not be the case: GDAL overview generation is also affected by this issue).

While this issue has been resolved by patching previously generated data on prod S3, a more sustainable long-term fix would be to set nodata attributes in our GeoTIFF writing code, to ensure that any floating point raster without a custom nodata attribute is assigned a nodata attribute of "nan".

The datacube.utils.cog.write_cog and datacube.utils.cog.to_cog functions are important places to fix this, as they are used in the generation of almost all ODC products. A possible fix would likely involve a minor change similar to psuedocode below somewhere here:

if (raster is floating point dtype) AND (no nodata attribute set):
    raster.nodata = "nan"

The alternative to fixing it in GeoTIFF writing code would be to make sure all downstream product worklows manually set nodata values. However, in my opinion this would be difficult to document and would be highly likely to be missed in the future, leading to more expensive remediation. Fixing it at the source seems like a better option, especially as extensive testing has revealed no downsteam impacts of the change while successfully solve the issue for ESRI users (a large proportion of our user base).

SpacemanPaul commented 4 weeks ago

NB. In 1.9, these functions should be reimplemented to use odc.geo.xr.write_cog and odc.geo.xr.to_cog respectively.

(Do these odc-geo functions already have this functionality?)

robbibt commented 2 weeks ago

Update: turns out that a missing nodata attribute on floating point rasters also impacts gdal too - e.g. overview generation is messed up if a floating point raster contains nodata pixels but does not have a nodata attribute set (specifically, it looks like any overview pixel containing at least one higher res nodata pixel is set to nan if the data is missing a nodata attribute).

Nodata attribute set to nodata=nan: cog_resampling_nan

Missing nodata attribute: cog_resampling_none

import rioxarray
import odc.geo.xr

# Load data and remove fill value
ds = rioxarray.open_rasterio(
    "https://data.dea.ga.gov.au/derivative/ga_s2ls_intertidal_cyear_3/1-0-0/x101/y148/2022--P1Y/ga_s2ls_intertidal_cyear_3_x101y148_2022--P1Y_final_elevation.tif"
)
del ds.attrs["_FillValue"]

# Export raster with average resampling, no nodata attribute
ds.odc.write_cog(
    "cog_resampling_none.tif",
    overwrite=True,
    overview_levels=[2, 4, 8, 16, 32],
    overview_resampling="AVERAGE",
)

# Export raster with average resampling, nodata=nan
ds.attrs["nodata"] = np.nan
ds.odc.write_cog(
    "cog_resampling_nan.tif",
    overwrite=True,
    overview_levels=[2, 4, 8, 16, 32],
    overview_resampling="AVERAGE",
)

robbibt commented 2 weeks ago

NB. In 1.9, these functions should be reimplemented to use odc.geo.xr.write_cog and odc.geo.xr.to_cog respectively.

(Do these odc-geo functions already have this functionality?)

Good call - it would be good to fix this in odc-geo too given that will be built into datacube in the future. Have just confirmed that odc-geo's .odc.write_cog() function doesn't set nodata=nan for floating point rasters with no nodata flag either.

@Kirill888 See above regarding incompatibility issues with both ESRI and gdal caused by the way we currently write out floating point COGs that are missing a nodata attribute. Would you be comfortable with this proposed fix (setting nodata=nan for floating point rasters that are missing a nodata flag) being added to the COG writing funcs in odc-geo itself? This is an issue that is having widespread impacts across our data in DEA, so we're keen to get a solution as high upstream as we can if possible.

Kirill888 commented 2 weeks ago

@Kirill888 See above regarding incompatibility issues with both ESRI and gdal caused by the way we currently write out floating point COGs that are missing a nodata attribute. Would you be comfortable with this proposed fix (setting nodata=nan for floating point rasters that are missing a nodata flag) being added to the COG writing funcs in odc-geo itself? This is an issue that is having widespread impacts across our data in DEA, so we're keen to get a solution as high upstream as we can if possible.

@robbibt Not clear to me which parts are or aren't working. If nodata=nan is set on the xarray it should also be set in the output of COG, if nodata attributes is not set, then it should not appear in the output.

If your proposal is to inject nodata=nan for floating point pixel output even when nodata attribute is missing on the xarray, then I'm against that, as this will make it impossible to create COGs without nodata tag. If there is software that breaks because that field is missing there will be software that will fail due to nodata=nan being set. @robbibt of the two outputs which one do you consider as a more broken one? Also not sure what they show.

Difference in overviews that are generated based on whether nodata=nan is set or not is more interesting, common error here would be to perform straight comparison with nodata value, but of course that breaks down when nodata=nan, cause NOTHING will be equal to nodata value if a simple == is used. My guess would be that nodata=nan generates more brokenness inside GDAL than just relying on nan handling. That aspect needs investigating, it will also depend a lot on resampling method used, as these are most likely completely separate code-paths inside gdal.

robbibt commented 2 weeks ago

Hey @Kirill888, the original issue we encountered is that floating point COGs without a NODATA value flag are incompatible with ESRI software (ArcMap, ArcGIS Pro), which is unable to recognise those pixels as nodata. While this is just one software suite, for better or worse ESRI is the GIS field leader with an extremely large user base - this means that rasters generated natively using ODC tooling are inaccessible to a very large proportion of possible users.

We had previous believed that setting nodata=nan had no impact at all on GDAL-based software (e.g. testing has shown data with nodata=nan loads correctly in datacube, rasterio, datacube-ows, QGIS etc). However, it appears that without nodata=nan, GDAL does not correctly recognise NODATA pixels during overview generation either.

I'll generate a nicer example next week, but essentially the examples above have a narrow strip of valid data along the coastline, and NODATA values elsewhere. When nodata=nan is configured, overview values are correctly calculated based on non-NODATA pixels according to GDAL's expected functionality (e.g. "average computes the average of all non-NODATA contributing pixels"). However, without nodata=nan, overview pixels become NODATA if any contributing pixels are NODATA - in contravention of GDAL's documented functionality.

SpacemanPaul commented 2 weeks ago

Float bands are an already a weird special case in that nan is effectively a nodata value even if it is not explicitly specified as such in metadata:

because of the limitations of floating point comparisons any non-nan nodata value would have to be chosen VERY carefully (with a detailed understanding IEEE-754 in mind); and
It is not clear what nan could mean OTHER than "nodata".

Given our reliance on GDAL, I think a workaround in either datacube-core, odc-geo or both is reasonable until/unless we can get it addressed in GDAL. This could be either an injection of nodata=nan or raising a warning.

Kirill888 commented 2 weeks ago

the original issue we encountered is that floating point COGs without a NODATA value flag are incompatible with ESRI software (ArcMap, ArcGIS Pro), which is unable to recognise those pixels as nodata. While this is just one software suite, for better or worse ESRI is the GIS field leader with an extremely large user base - this means that rasters generated natively using ODC tooling are inaccessible to a very large proportion of possible users.

Ok, fair enough. The question is then should we set nodata=nan by default when creating float32 arrays with nan based masking also? Will that break output to netcdf or zarr with xarray. From the "semantic" perspective nodata=nan does not make sense to me. To me nodata is a mechanism for modelling nan when using integer data types. What's more, nan is not even a single value, there are a whole family of bit patterns all of which mean nan in IEEE-754.

I guess we could have some kind of "manual" mode for COG creation where we do not inject any extra attributes/tags, but by default we add all the extra tags needed to make interfacing with external tools easier. We also still need to harmonize dask vs non-dask cog generation interface.

SpacemanPaul commented 2 weeks ago

The question is then should we set nodata=nan by default when creating float32 arrays with nan based masking also?

I think not. This is a workaround for (separate but similar) bugs in

a) the way ESRI reads from COGs and b) the way GDAL generates COG overviews.

robbibt commented 2 weeks ago

Here's a clearer example of the GDAL incompatibility:

Example float32 array with mostly NODATA on the left and values of 1 on the right, except with a narrow strip of 1s and a narrow strip of NODATA:

from odc.geo.geobox import GeoBox
from odc.geo.xr import xr_zeros
gbox = GeoBox.from_bbox(
   (-2_000_000, -5_000_000,
   2_250_000, -1_000_000),
   "epsg:3577", shape=(101, 101))

da = xr_zeros(gbox, dtype='float32') + 1
da.data[:, 0:50] = np.nan
da.data[:, 25] = 1
da.data[:, 75] = np.nan

COG exported with "average" resampling and nodata=nan: this currectly follows GDAL's documented resampling functionality (i.e. ignoring NODATA pixels in the calculation).

# Export raster with average resampling, nodata=nan
da.attrs["nodata"] = np.nan
da.odc.write_cog(
    "cog_resampling_nan.tif",
    overwrite=True,
    overview_levels=[2, 4, 8, 16, 32],
    overview_resampling="AVERAGE",
)

COG exported with "average" resampling and no nodata attribute: nodata pixels have been incorrectly "inflated", with any overview pixel containing even one NODATA pixel now set to NODATA (conflicting with GDAL's expected functionality).

# Export raster with average resampling, no nodata attribute
da.odc.write_cog(
    "cog_resampling_none.tif",
    overwrite=True,
    overview_levels=[2, 4, 8, 16, 32],
    overview_resampling="AVERAGE",
)

So for better or worse, both ESRI and GDAL need NODATA to be explicitly defined as nodata=nan in the COG in order to correctly recognise pixels as NODATA.

I guess we could have some kind of "manual" mode for COG creation where we do not inject any extra attributes/tags, but by default we add all the extra tags needed to make interfacing with external tools easier.

I think that would be perfect - the fact that the lack of nodata=nan causes issues in both GDAL and ESRI justifies applying it as a pragmatic default, but a manual mode that would let users turn this off would be important to include.

robbibt commented 2 weeks ago

The question is then should we set nodata=nan by default when creating float32 arrays with nan based masking also?

I think not. This is a workaround for (separate but similar) bugs in

a) the way ESRI reads from COGs and b) the way GDAL generates COG overviews.

I'm not so sure - if our code is generating an empty float array populated by nan, then I think setting the nodata attribute to nodata=nan probably make logical sense too (due to our datacube product definitions, many of the COGs that are currently missing nodata tags on disk actually had xarray's nodata attribute set to nan at load time anyway, and we've never noticed any xarray/NetCDF compatibility issues caused by that).

But I agree that implementing a fix/sensible default in the COG writing code is the top priority.

Kirill888 commented 2 weeks ago

to make things even more annoying, nan is not a valid JSON value, so it needs to be stored as a string in the database metadata blob, not sure what zarr does (zarr is basically a JSON-based standard for describing arrays composed of chunks)

Kirill888 commented 2 weeks ago

this is what zarr produces:

{
    "_ARRAY_DIMENSIONS": [
        "y",
        "x"
    ],
    "coordinates": "lon lat",
    "long_name": "Surface Elevation",
    "nodata": NaN,
    "units": "m"
}

you can see that github highlighting is not happy with NaN in JSON. Reading back that zarr gives no issues however, at least in Python.

Kirill888 commented 2 weeks ago

Related issue in Zarr from a while back (2019)

https://github.com/zarr-developers/zarr-python/issues/412

With a bunch of non-python implementations, or non-python storage backends having issues with zarr files as produced above.

In light of that, probably best to keep nodata=nan out of xarray attributes, or make sure it's stored as "nan" string instead (.odc.nodata should be able to convert str->float at access time, maybe already does). We should keep treating lack of nodata on float data to be equivalent to nodata=nan. Also it would be good to support "confirmed to be absent" flag of sorts for nodata, to distinguish from "nodata is not set, there are probably nans there". Maybe having xx.attrs.get("nodata", ()) is None mean different thing from "nodata" not in xx.attrs (currently the two are equivalent).

SpacemanPaul commented 1 week ago

Reopening because we still need to update the 1.9 branch to use odc-geo for COG generation.