Workaround for float images that are all the same value

gillins commented 1 year ago

With recent versions of GDAL, if you create a float image that is all the same value you get a:

ERROR 5: /tmp/bob.img, band 1: dfMax should be strictly greater than dfMin

This used to be a warning, but is now an error. Turns out this is coming from the histogram calculation.

I know this is an unlikely situation in remote sensing, but pops up occasionally in spatial modelling.

The proposed workaround is to make the max 0.5 larger than the min. Not sure if this is the correct approach...

Also amended the docstring for controls.setStatsIgnore to tell you how to unset the ignore value.

neilflood commented 1 year ago

Oh dear.

Well spotted. You are right that we do need to cope with this. That error message will become an exception in the next major GDAL release (when exceptions will default to on), so we need to be on top of it. Thank you, as always.

I feel like this is not quite right, because then the histmax value recorded into the histogram is also wrong(-ish)? I don't remember how that value is defined, so perhaps it doesn't matter. What happens if we record the true histmax, but pass the +0.5 version to GetHistogram?

Why do we always find things like this just after we make a release......

gillins commented 1 year ago

Yep always after a new release, but I don't think there is any harm doing more frequent releases anyway....

There are actually 3 different concepts in the code:

minval/maxval - as calculated from the image
histmin/histmax - what is recorded in STATISTICS_HISTOMIN/STATISTICS_HISTOMAX
histCalcMin/histCalcMax what is actually used to calculate the histogram

I'm not sure why we allow the last 2 to be different, but I think you are suggesting we fudge histCalcMax but not histmax? I'm not sure if this would be confusing for someone looking at the output of gdalinfo? No strong thoughts though.

Will this calcstats code ever be correct, I wonder.

neilflood commented 1 year ago

Yes, that's what I am suggesting, although I do share your uncertainty. I am not sure exactly which/what is visible to the output of gdalinfo, and I was meaning that I don't want whatever is visible to show the +0.5, because that could be confusing. But I dunno, really.

Yes, I am also rather deflated by how many things we keep having to fix in calcstats. It also makes me wonder again what everyone else does - maybe most people are not calculating stats? I dunno.......

neilflood commented 1 year ago

OK, further to our discussion offline, I have done some playing around.

The option of reducing the histogram to a single bin seems the best, so far. Not doing something like this makes it very difficult to correctly interpret the histogram.

For completeness, I tested on a 100x100 float32 raster, with the top 10 rows set to a null value of zero, and the rest set to 10. When the format is GTiff, the stats part of the output from gdalinfo is as follows

  Min=10.000 Max=10.000 
  Minimum=10.000, Maximum=10.000, Mean=10.000, StdDev=0.000
  NoData Value=0
  Metadata:
    STATISTICS_EXCLUDEDVALUES=0
    STATISTICS_HISTOBINFUNCTION=linear
    STATISTICS_HISTOBINVALUES=9000|
    STATISTICS_HISTOMAX=10.0
    STATISTICS_HISTOMIN=10.0
    STATISTICS_HISTONUMBINS=1
    STATISTICS_MAXIMUM=10.0
    STATISTICS_MEAN=10.0
    STATISTICS_MEDIAN=10.0
    STATISTICS_MINIMUM=10.0
    STATISTICS_MODE=10.0
    STATISTICS_SKIPFACTORX=1
    STATISTICS_SKIPFACTORY=1
    STATISTICS_STDDEV=0.0

showing the single bin, and the histmax and histomin values being equal.

When the format is KEA, this becomes

  Min=10.000 Max=10.000 
  Minimum=10.000, Maximum=10.000, Mean=10.000, StdDev=0.000
  NoData Value=0
  Metadata:
    STATISTICS_EXCLUDEDVALUES=0
    STATISTICS_HISTOBINFUNCTION=linear
    STATISTICS_HISTOMAX=10
    STATISTICS_HISTOMIN=10
    STATISTICS_HISTONUMBINS=1
    STATISTICS_MAXIMUM=10.0
    STATISTICS_MEAN=10.0
    STATISTICS_MEDIAN=10.0
    STATISTICS_MINIMUM=10.0
    STATISTICS_MODE=10.0
    STATISTICS_SKIPFACTORX=1
    STATISTICS_SKIPFACTORY=1
    STATISTICS_STDDEV=0.0
    STATISTICS_VALID_PERCENT=90
    LAYER_TYPE=athematic
    ATTRIBUTETABLE_CHUNKSIZE=1000
<GDALRasterAttributeTable Row0Min="10" BinSize="-nan" tableType="athematic">
  <FieldDefn index="0">
    <Name>Histogram</Name>
    <Type>1</Type>
    <Usage>1</Usage>
  </FieldDefn>
  <Row index="0">
    <F>9000</F>
  </Row>
</GDALRasterAttributeTable>

Note that it has BinSize as "-nan", which could conceivably be a problem, not sure. Is the calculation in libkea right?

And the HFA version is very similar (does not seem to have an equivalent to BinSize)

  Min=10.000 Max=10.000 
  Minimum=10.000, Maximum=10.000, Mean=10.000, StdDev=0.000
  NoData Value=0
  Metadata:
    LAYER_TYPE=athematic
    STATISTICS_EXCLUDEDVALUES=
    STATISTICS_HISTOBINFUNCTION=linear
    STATISTICS_HISTOBINVALUES=9000|
    STATISTICS_HISTOMAX=10
    STATISTICS_HISTOMIN=10
    STATISTICS_HISTONUMBINS=1
    STATISTICS_MAXIMUM=10
    STATISTICS_MEAN=10
    STATISTICS_MEDIAN=10
    STATISTICS_MINIMUM=10
    STATISTICS_MODE=10
    STATISTICS_SKIPFACTORX=1
    STATISTICS_SKIPFACTORY=1
    STATISTICS_STDDEV=0
  Image Structure Metadata:
    COMPRESSION=RLE
<GDALRasterAttributeTable tableType="athematic">
  <FieldDefn index="0">
    <Name>Histogram</Name>
    <Type>1</Type>
    <Usage>1</Usage>
  </FieldDefn>
  <Row index="0">
    <F>9000</F>
  </Row>
</GDALRasterAttributeTable>

Having the single bin in the histogram, and with the max and min values equal, means that there is no chance (I think?) of calculating incorrect values for bin boundaries. The step size would calculate to zero (because histmax == histmin) and no steps are being taken (because there is only the first bin anyway). So, this seems very safe against mis-interpretation.

I still feel a bit uncertain about this, partly because I don't really know why the change was made in GDAL. I wonder what they were trying to guard against?

What do you think?

gillins commented 1 year ago

I think this is good.

The problem with KEA is probably related to this line: https://github.com/OSGeo/gdal/blob/master/frmts/kea/kearat.cpp#L996. I can't remember why the -1, I though it could have come from HFA but not sure.... I don't think it matters too much.

I'm happy for this to be merged and see what the fallout is. It's a pretty unusual situation. And see if we can fix this the -nan with the KEA driver.

neilflood commented 1 year ago

re: libkea. Ah yes, I see. You are right, that will be the reason. We should sit down together at some stage and check whether that line is entirely correct. Histograms always seem to be difficult to get exactly right.......

ubarsc / rios

Workaround for float images that are all the same value #72