Open m-mohr opened 2 years ago
One issue that might occur is to define > 0, then you'd need to do something like [0.000000000000000000000000000000000000000001, null]
So an alternative would be to allow a minimal subset of JSON Schema (minimum, maximum, exclusiveMinimum, exclusiveMaximum) and allow an object instead of an array, e.g. for > 0:
{
"exclusiveMinimum": 0
}
would it be terrible to be super explicit like
"classification:classes": [
{
"value": -1,
"name": "missing-value",
"description": "Missing value (no-data)",
"nodata": true
},
{
"values": [-3, 1, 7],
"name": "no-coverage",
"description": "No coverage (no-data)",
"nodata": true
},
{
"range": [0, null], # or json schema object...
"name": "data",
"description": "Actual data values in mm"
}
],
Yes, I think it is terrible ;-) How would you decide whether [1,2] is a range from 1 to 2 or the two categorical values 1 and 2?
that's why the keys are explicit value
, values
, and range
Ooooh, I didn't catch that difference. Sorry. I don't think that is necessary, it is more complicated to describe and read but doesn't give any obvious benefit to me?
I just don't like that [1, null]
is a magic range while [1, 255]
is ambiguous as a range or a list of values.
I do really like ranges that are json schema objects.
and of course I still don't like putting ranges into classes ;), but I want to at least get somewhere with the concept.
I did not consider lists of values in my proposal yet because you can emulate them just by having the classes multiple times while you can't reasonable express continuous ranges. But yeah, the full-fledged solution would be:
Not sure whether we should cater for all, the use cases I've heard about so far were only continuous ranges.
Alternatively, we could try something like the following although I'm not sure that would be valid in raster as we give "made-up" statistics / exclude no-data values from statistics. It also feels less intuitive.
"raster:bands": [
{
"unit": "mm",
"data_type": "float64",
"statistics": {
"minimum": 0
},
"classification:incomplete": true,
"classification:classes": [
{
"value": -1.0,
"name": "missing-value",
"description": "Missing value (no-data)",
"nodata": true
},
{
"value": -3.0,
"name": "no-coverage",
"description": "No coverage (no-data)",
"nodata": true
}
]
}
],
Thoughts, @emmanuelmathot ?
Any opinion on preferring
"maximum": 10.5,
"exclusiveMaximum": true
versus
"exclusiveMaximum": 10.5
I'd follow JSON Schema as we already use it in other places, which (except for the outdated draft-4) use numbers instead of boolean flags: https://json-schema.org/understanding-json-schema/reference/numeric.html#range
ok, didn't realize I was looking at an older draft 👍
"classification:incomplete": true,
is interesting to me because saying
"range": [0, null], # or json schema object...
"name": "data",
"description": "Actual data values in mm"
seems redundant or out of place when the data set is describing rain fall in mm and a negative depth doesn't make sense.
Yeah, I'm liking it more the more I'm thinking about it but it's less flexible and covers only some use cases, I assume. Also, it doesn't seem so wrong to exclude no-data values from statistics because they are usually always just made-up values for the file format that doesn't support encoding them properly. I guess we only need to clarify in raster that no-data values are invalid pixel values and as should not be reflected in statistics etc. On the other hand, statistics are usually real min/max values while what we want to describe here are theoretical min and max values. For example, if you have a raster with precipation values, the min and max could be 1, 5 and 10 so min/max are 1 and 10, although the potential range is 0 to infinity (mostly). But maybe that's not an issue?!
are we really saying that this is a continuous dataset with classed nodata and should have something roughly like:
"nodata": {
"classification:classes": {
... classes
}
with something else that says that clarifies that the data range of possible values does not include the full range of the datatype?
~Hmm, then I still don't have a way to express no-data values and their meanings in STAC. In file it was removed, in raster it got somewhat rejected. I really just want to express -1 is missing value, -3 is no coverage for example. And it seems it would fit in here.~
Sorry, misunderstood you initially. But still not sure, I think I like the proposal above more, because it just adds an additional field hier instead of adding a new data type to an existing field. https://github.com/stac-extensions/classification/issues/33#issuecomment-1171498235
yes, the question is more "is classification
a good enough home for nodata" versus "nodata can be messy enough to warrant some kind of new extension that can use classification if needed" and I understand not wanting to start another extension...
Well, nodata is already part of raster so would be a change in that extension. But I don't like putting classification:classes into so many different places. Also, if you have no-data values and categorical values in a file, do you really want to have them in two different places?
classification: ¯\_(ツ)_/¯: true
The more I think about it, saying "this dataset uses classes but isn't classified" seems reasonable and simple.
I created PR #34 to discuss a potential solution more closely.
Alternatively, we could try something like the following although I'm not sure that would be valid in raster as we give "made-up" statistics / exclude no-data values from statistics. It also feels less intuitive.
"raster:bands": [ { "unit": "mm", "data_type": "float64", "statistics": { "minimum": 0 }, "classification:incomplete": true, "classification:classes": [ { "value": -1.0, "name": "missing-value", "description": "Missing value (no-data)", "nodata": true }, { "value": -3.0, "name": "no-coverage", "description": "No coverage (no-data)", "nodata": true } ] } ],
Thoughts, @emmanuelmathot ?
statistics
field represents stats about the distribution of ALL pixels in the band ¯_(ツ)_/¯ but using for stats of only VALID PIXELS and thus define boundaries is not strictly forbidden :-). For instance, we use that information to help user selecting the possible range. In this case, this could be interesting.
I did not consider lists of values in my proposal yet because you can emulate them just by having the classes multiple times while you can't reasonable express continuous ranges. But yeah, the full-fledged solution would be:
- integer: single categorical value
- array of integers: multiple categorical values
- json schema like object: continuous ranges
Not sure whether we should cater for all, the use cases I've heard about so far were only continuous ranges.
I like the "full-fledged solution". However, even if the array of integers doesn't make it in, I prefer the json schema like object for continuous ranges for its clarity; it also leaves the door open to adding arrays of integers without having to change how continuous ranges are expressed.
Mocking up classification:classes
for a VIIRS vegetation index band:
{
...
"scale": 0.0001,
"data_type": "int16",
"classification:classes": [
{
"value": -13000,
"name": "fill_land",
"description": "Fill value over land",
"nodata": true
},
{
"value": -15000,
"name": "fill_water",
"description": "Fill value over ocean or fresh water",
"nodata": true
},
{
"value": {
"minimum": -10000,
"maximum": 10000
},
"name": "data",
"description": "Valid range of vegetation index values"
}
],
...
}
Perhaps not necessary, but it is nice to be able to describe the valid range of vegetation index data (a defined subset of the possible int16 values).
To me describing the valid range of a continuous dataset has nothing to do with classification. I'm not sure how a client can or should deal with that class when it isn't a class at all.
@drwelby I see your point, I think. I suppose the same argument could be made for any continuous range? Or is it particular to the valid range?
To me the valid range is akin to raster:bits_per_sample
and should live there.
Yep, I see the connection to bits_per_sample
. In this case, the range doesn't fit cleanly into a set number of bits. But I get your point about it not being a class. I'm not concerned about including this information, so we don't need to take this any further. On the face of it, it seemed like it would make sense to describe the data range since the no-data values are also being described. But if there is no value on the client end, then no point. 🙂
From the STAC call: No one screamed at me when I said "ranges" are no categories. ;-)
I think we can leave this open for further feedback, but I won't push for a change here. If you only want to describe a single class of valid values (e.g. >= 0), then consider using the statistics or histogram in raster:bands.
Here's an example where allowing a range for the Class object value could have been useful:
The cover change values are interpreted as <from class><to class>
, e.g., a value of 12 indicates a change from class 1 to class 2. So they could all be mapped to unique categories. But that seems overkill.
@pjhartzell How would you want to expose that exactly? 12-21, 23-32, 34-43, ...? or just 12-87?
For this case, [12-21, 23-32, 34-43] would be ideal. [12-87] would be a fallback if multiple ranges can't be expressed.
It comes up over and over again, the range values. Recently in #31. A common example seems to be something like:
Should we cater for this? I think the simplest solution would be to allow for
value
an array with two values that can on one side ne null (for open-ended range) as defined also by the STAC Collection extents.Then you could have something like: