Closed gillins closed 1 year ago
Oh dear.
Well spotted. You are right that we do need to cope with this. That error message will become an exception in the next major GDAL release (when exceptions will default to on), so we need to be on top of it. Thank you, as always.
I feel like this is not quite right, because then the histmax value recorded into the histogram is also wrong(-ish)? I don't remember how that value is defined, so perhaps it doesn't matter. What happens if we record the true histmax, but pass the +0.5 version to GetHistogram?
Why do we always find things like this just after we make a release......
Yep always after a new release, but I don't think there is any harm doing more frequent releases anyway....
There are actually 3 different concepts in the code:
I'm not sure why we allow the last 2 to be different, but I think you are suggesting we fudge histCalcMax but not histmax? I'm not sure if this would be confusing for someone looking at the output of gdalinfo? No strong thoughts though.
Will this calcstats code ever be correct, I wonder.
Yes, that's what I am suggesting, although I do share your uncertainty. I am not sure exactly which/what is visible to the output of gdalinfo, and I was meaning that I don't want whatever is visible to show the +0.5, because that could be confusing. But I dunno, really.
Yes, I am also rather deflated by how many things we keep having to fix in calcstats. It also makes me wonder again what everyone else does - maybe most people are not calculating stats? I dunno.......
OK, further to our discussion offline, I have done some playing around.
The option of reducing the histogram to a single bin seems the best, so far. Not doing something like this makes it very difficult to correctly interpret the histogram.
For completeness, I tested on a 100x100 float32 raster, with the top 10 rows set to a null value of zero, and the rest set to 10. When the format is GTiff, the stats part of the output from gdalinfo is as follows
Min=10.000 Max=10.000
Minimum=10.000, Maximum=10.000, Mean=10.000, StdDev=0.000
NoData Value=0
Metadata:
STATISTICS_EXCLUDEDVALUES=0
STATISTICS_HISTOBINFUNCTION=linear
STATISTICS_HISTOBINVALUES=9000|
STATISTICS_HISTOMAX=10.0
STATISTICS_HISTOMIN=10.0
STATISTICS_HISTONUMBINS=1
STATISTICS_MAXIMUM=10.0
STATISTICS_MEAN=10.0
STATISTICS_MEDIAN=10.0
STATISTICS_MINIMUM=10.0
STATISTICS_MODE=10.0
STATISTICS_SKIPFACTORX=1
STATISTICS_SKIPFACTORY=1
STATISTICS_STDDEV=0.0
showing the single bin, and the histmax and histomin values being equal.
When the format is KEA, this becomes
Min=10.000 Max=10.000
Minimum=10.000, Maximum=10.000, Mean=10.000, StdDev=0.000
NoData Value=0
Metadata:
STATISTICS_EXCLUDEDVALUES=0
STATISTICS_HISTOBINFUNCTION=linear
STATISTICS_HISTOMAX=10
STATISTICS_HISTOMIN=10
STATISTICS_HISTONUMBINS=1
STATISTICS_MAXIMUM=10.0
STATISTICS_MEAN=10.0
STATISTICS_MEDIAN=10.0
STATISTICS_MINIMUM=10.0
STATISTICS_MODE=10.0
STATISTICS_SKIPFACTORX=1
STATISTICS_SKIPFACTORY=1
STATISTICS_STDDEV=0.0
STATISTICS_VALID_PERCENT=90
LAYER_TYPE=athematic
ATTRIBUTETABLE_CHUNKSIZE=1000
<GDALRasterAttributeTable Row0Min="10" BinSize="-nan" tableType="athematic">
<FieldDefn index="0">
<Name>Histogram</Name>
<Type>1</Type>
<Usage>1</Usage>
</FieldDefn>
<Row index="0">
<F>9000</F>
</Row>
</GDALRasterAttributeTable>
Note that it has BinSize as "-nan", which could conceivably be a problem, not sure. Is the calculation in libkea right?
And the HFA version is very similar (does not seem to have an equivalent to BinSize)
Min=10.000 Max=10.000
Minimum=10.000, Maximum=10.000, Mean=10.000, StdDev=0.000
NoData Value=0
Metadata:
LAYER_TYPE=athematic
STATISTICS_EXCLUDEDVALUES=
STATISTICS_HISTOBINFUNCTION=linear
STATISTICS_HISTOBINVALUES=9000|
STATISTICS_HISTOMAX=10
STATISTICS_HISTOMIN=10
STATISTICS_HISTONUMBINS=1
STATISTICS_MAXIMUM=10
STATISTICS_MEAN=10
STATISTICS_MEDIAN=10
STATISTICS_MINIMUM=10
STATISTICS_MODE=10
STATISTICS_SKIPFACTORX=1
STATISTICS_SKIPFACTORY=1
STATISTICS_STDDEV=0
Image Structure Metadata:
COMPRESSION=RLE
<GDALRasterAttributeTable tableType="athematic">
<FieldDefn index="0">
<Name>Histogram</Name>
<Type>1</Type>
<Usage>1</Usage>
</FieldDefn>
<Row index="0">
<F>9000</F>
</Row>
</GDALRasterAttributeTable>
Having the single bin in the histogram, and with the max and min values equal, means that there is no chance (I think?) of calculating incorrect values for bin boundaries. The step size would calculate to zero (because histmax == histmin) and no steps are being taken (because there is only the first bin anyway). So, this seems very safe against mis-interpretation.
I still feel a bit uncertain about this, partly because I don't really know why the change was made in GDAL. I wonder what they were trying to guard against?
What do you think?
I think this is good.
The problem with KEA is probably related to this line: https://github.com/OSGeo/gdal/blob/master/frmts/kea/kearat.cpp#L996. I can't remember why the -1
, I though it could have come from HFA but not sure.... I don't think it matters too much.
I'm happy for this to be merged and see what the fallout is. It's a pretty unusual situation. And see if we can fix this the -nan
with the KEA driver.
re: libkea. Ah yes, I see. You are right, that will be the reason. We should sit down together at some stage and check whether that line is entirely correct. Histograms always seem to be difficult to get exactly right.......
With recent versions of GDAL, if you create a float image that is all the same value you get a:
This used to be a warning, but is now an error. Turns out this is coming from the histogram calculation.
I know this is an unlikely situation in remote sensing, but pops up occasionally in spatial modelling.
The proposed workaround is to make the max 0.5 larger than the min. Not sure if this is the correct approach...
Also amended the docstring for
controls.setStatsIgnore
to tell you how to unset the ignore value.