Open robbibt opened 4 weeks ago
NB. In 1.9, these functions should be reimplemented to use odc.geo.xr.write_cog
and odc.geo.xr.to_cog
respectively.
(Do these odc-geo functions already have this functionality?)
Update: turns out that a missing nodata attribute on floating point rasters also impacts gdal
too - e.g. overview generation is messed up if a floating point raster contains nodata pixels but does not have a nodata attribute set (specifically, it looks like any overview pixel containing at least one higher res nodata pixel is set to nan
if the data is missing a nodata attribute).
Nodata attribute set to nodata=nan
:
Missing nodata attribute:
import rioxarray
import odc.geo.xr
# Load data and remove fill value
ds = rioxarray.open_rasterio(
"https://data.dea.ga.gov.au/derivative/ga_s2ls_intertidal_cyear_3/1-0-0/x101/y148/2022--P1Y/ga_s2ls_intertidal_cyear_3_x101y148_2022--P1Y_final_elevation.tif"
)
del ds.attrs["_FillValue"]
# Export raster with average resampling, no nodata attribute
ds.odc.write_cog(
"cog_resampling_none.tif",
overwrite=True,
overview_levels=[2, 4, 8, 16, 32],
overview_resampling="AVERAGE",
)
# Export raster with average resampling, nodata=nan
ds.attrs["nodata"] = np.nan
ds.odc.write_cog(
"cog_resampling_nan.tif",
overwrite=True,
overview_levels=[2, 4, 8, 16, 32],
overview_resampling="AVERAGE",
)
NB. In 1.9, these functions should be reimplemented to use
odc.geo.xr.write_cog
andodc.geo.xr.to_cog
respectively.(Do these odc-geo functions already have this functionality?)
Good call - it would be good to fix this in odc-geo
too given that will be built into datacube
in the future. Have just confirmed that odc-geo
's .odc.write_cog()
function doesn't set nodata=nan
for floating point rasters with no nodata flag either.
@Kirill888 See above regarding incompatibility issues with both ESRI and gdal
caused by the way we currently write out floating point COGs that are missing a nodata attribute. Would you be comfortable with this proposed fix (setting nodata=nan
for floating point rasters that are missing a nodata flag) being added to the COG writing funcs in odc-geo
itself? This is an issue that is having widespread impacts across our data in DEA, so we're keen to get a solution as high upstream as we can if possible.
@Kirill888 See above regarding incompatibility issues with both ESRI and
gdal
caused by the way we currently write out floating point COGs that are missing a nodata attribute. Would you be comfortable with this proposed fix (settingnodata=nan
for floating point rasters that are missing a nodata flag) being added to the COG writing funcs inodc-geo
itself? This is an issue that is having widespread impacts across our data in DEA, so we're keen to get a solution as high upstream as we can if possible.
@robbibt Not clear to me which parts are or aren't working. If nodata=nan
is set on the xarray
it should also be set in the output of COG, if nodata
attributes is not set, then it should not appear in the output.
If your proposal is to inject nodata=nan
for floating point pixel output even when nodata
attribute is missing on the xarray
, then I'm against that, as this will make it impossible to create COGs without nodata
tag. If there is software that breaks because that field is missing there will be software that will fail due to nodata=nan
being set. @robbibt of the two outputs which one do you consider as a more broken one? Also not sure what they show.
Difference in overviews that are generated based on whether nodata=nan
is set or not is more interesting, common error here would be to perform straight comparison with nodata
value, but of course that breaks down when nodata=nan
, cause NOTHING will be equal to nodata value if a simple ==
is used. My guess would be that nodata=nan
generates more brokenness inside GDAL than just relying on nan
handling. That aspect needs investigating, it will also depend a lot on resampling method used, as these are most likely completely separate code-paths inside gdal.
Hey @Kirill888, the original issue we encountered is that floating point COGs without a NODATA value flag are incompatible with ESRI software (ArcMap, ArcGIS Pro), which is unable to recognise those pixels as nodata. While this is just one software suite, for better or worse ESRI is the GIS field leader with an extremely large user base - this means that rasters generated natively using ODC tooling are inaccessible to a very large proportion of possible users.
We had previous believed that setting nodata=nan
had no impact at all on GDAL-based software (e.g. testing has shown data with nodata=nan
loads correctly in datacube
, rasterio
, datacube-ows
, QGIS etc). However, it appears that without nodata=nan
, GDAL does not correctly recognise NODATA pixels during overview generation either.
I'll generate a nicer example next week, but essentially the examples above have a narrow strip of valid data along the coastline, and NODATA values elsewhere. When nodata=nan
is configured, overview values are correctly calculated based on non-NODATA pixels according to GDAL's expected functionality (e.g. "average computes the average of all non-NODATA contributing pixels"). However, without nodata=nan
, overview pixels become NODATA if any contributing pixels are NODATA - in contravention of GDAL's documented functionality.
Float bands are an already a weird special case in that nan
is effectively a nodata value even if it is not explicitly specified as such in metadata:
nan
nodata value would have to be chosen VERY carefully (with a detailed understanding IEEE-754 in mind); andnan
could mean OTHER than "nodata".Given our reliance on GDAL, I think a workaround in either datacube-core, odc-geo or both is reasonable until/unless we can get it addressed in GDAL. This could be either an injection of nodata=nan
or raising a warning.
the original issue we encountered is that floating point COGs without a NODATA value flag are incompatible with ESRI software (ArcMap, ArcGIS Pro), which is unable to recognise those pixels as nodata. While this is just one software suite, for better or worse ESRI is the GIS field leader with an extremely large user base - this means that rasters generated natively using ODC tooling are inaccessible to a very large proportion of possible users.
Ok, fair enough. The question is then should we set nodata=nan
by default when creating float32
arrays with nan
based masking also? Will that break output to netcdf or zarr with xarray. From the "semantic" perspective nodata=nan
does not make sense to me. To me nodata
is a mechanism for modelling nan
when using integer data types. What's more, nan
is not even a single value, there are a whole family of bit patterns all of which mean nan
in IEEE-754.
I guess we could have some kind of "manual" mode for COG creation where we do not inject any extra attributes/tags, but by default we add all the extra tags needed to make interfacing with external tools easier. We also still need to harmonize dask vs non-dask cog generation interface.
The question is then should we set nodata=nan by default when creating float32 arrays with nan based masking also?
I think not. This is a workaround for (separate but similar) bugs in
a) the way ESRI reads from COGs and b) the way GDAL generates COG overviews.
Here's a clearer example of the GDAL incompatibility:
Example float32
array with mostly NODATA on the left and values of 1 on the right, except with a narrow strip of 1s and a narrow strip of NODATA:
from odc.geo.geobox import GeoBox
from odc.geo.xr import xr_zeros
gbox = GeoBox.from_bbox(
(-2_000_000, -5_000_000,
2_250_000, -1_000_000),
"epsg:3577", shape=(101, 101))
da = xr_zeros(gbox, dtype='float32') + 1
da.data[:, 0:50] = np.nan
da.data[:, 25] = 1
da.data[:, 75] = np.nan
COG exported with "average" resampling and nodata=nan
: this currectly follows GDAL's documented resampling functionality (i.e. ignoring NODATA pixels in the calculation).
# Export raster with average resampling, nodata=nan
da.attrs["nodata"] = np.nan
da.odc.write_cog(
"cog_resampling_nan.tif",
overwrite=True,
overview_levels=[2, 4, 8, 16, 32],
overview_resampling="AVERAGE",
)
COG exported with "average" resampling and no nodata attribute: nodata pixels have been incorrectly "inflated", with any overview pixel containing even one NODATA pixel now set to NODATA (conflicting with GDAL's expected functionality).
# Export raster with average resampling, no nodata attribute
da.odc.write_cog(
"cog_resampling_none.tif",
overwrite=True,
overview_levels=[2, 4, 8, 16, 32],
overview_resampling="AVERAGE",
)
So for better or worse, both ESRI and GDAL need NODATA to be explicitly defined as nodata=nan
in the COG in order to correctly recognise pixels as NODATA.
I guess we could have some kind of "manual" mode for COG creation where we do not inject any extra attributes/tags, but by default we add all the extra tags needed to make interfacing with external tools easier.
I think that would be perfect - the fact that the lack of nodata=nan
causes issues in both GDAL and ESRI justifies applying it as a pragmatic default, but a manual mode that would let users turn this off would be important to include.
The question is then should we set nodata=nan by default when creating float32 arrays with nan based masking also?
I think not. This is a workaround for (separate but similar) bugs in
a) the way ESRI reads from COGs and b) the way GDAL generates COG overviews.
I'm not so sure - if our code is generating an empty float
array populated by nan
, then I think setting the nodata
attribute to nodata=nan
probably make logical sense too (due to our datacube product definitions, many of the COGs that are currently missing nodata
tags on disk actually had xarray's nodata
attribute set to nan
at load time anyway, and we've never noticed any xarray/NetCDF compatibility issues caused by that).
But I agree that implementing a fix/sensible default in the COG writing code is the top priority.
to make things even more annoying, nan
is not a valid JSON value, so it needs to be stored as a string in the database metadata blob, not sure what zarr does (zarr is basically a JSON-based standard for describing arrays composed of chunks)
this is what zarr produces:
{
"_ARRAY_DIMENSIONS": [
"y",
"x"
],
"coordinates": "lon lat",
"long_name": "Surface Elevation",
"nodata": NaN,
"units": "m"
}
you can see that github highlighting is not happy with NaN
in JSON. Reading back that zarr gives no issues however, at least in Python.
Related issue in Zarr from a while back (2019)
https://github.com/zarr-developers/zarr-python/issues/412
With a bunch of non-python implementations, or non-python storage backends having issues with zarr files as produced above.
In light of that, probably best to keep nodata=nan
out of xarray attributes, or make sure it's stored as "nan"
string instead (.odc.nodata
should be able to convert str->float
at access time, maybe already does). We should keep treating lack of nodata
on float
data to be equivalent to nodata=nan
. Also it would be good to support "confirmed to be absent" flag of sorts for nodata
, to distinguish from "nodata is not set, there are probably nans
there". Maybe having xx.attrs.get("nodata", ()) is None
mean different thing from "nodata" not in xx.attrs
(currently the two are equivalent).
Reopening because we still need to update the 1.9 branch to use odc-geo for COG generation.
Hi all, as discussed internally within DEA, we have recently encountered an issue where floating point rasters are not displayed correctly in ESRI software due to the lack of an explicitly set nodata attribute of "nan". This differs from open software such as QGIS and GDAL, which correctly interpret nodata by implicitly treating a missing nodata attribute as equivalent to "nan" (edit: this turns out to not be the case: GDAL overview generation is also affected by this issue).
While this issue has been resolved by patching previously generated data on prod S3, a more sustainable long-term fix would be to set nodata attributes in our GeoTIFF writing code, to ensure that any floating point raster without a custom nodata attribute is assigned a nodata attribute of "nan".
The
datacube.utils.cog.write_cog
anddatacube.utils.cog.to_cog
functions are important places to fix this, as they are used in the generation of almost all ODC products. A possible fix would likely involve a minor change similar to psuedocode below somewhere here:The alternative to fixing it in GeoTIFF writing code would be to make sure all downstream product worklows manually set nodata values. However, in my opinion this would be difficult to document and would be highly likely to be missed in the future, leading to more expensive remediation. Fixing it at the source seems like a better option, especially as extensive testing has revealed no downsteam impacts of the change while successfully solve the issue for ESRI users (a large proportion of our user base).