stac-extensions / classification

Describes categorical values and bitfields to give values in a file a certain meaning (classification).
Apache License 2.0
11 stars 3 forks source link

Ranges? #33

Open m-mohr opened 2 years ago

m-mohr commented 2 years ago

It comes up over and over again, the range values. Recently in #31. A common example seems to be something like:

Should we cater for this? I think the simplest solution would be to allow for value an array with two values that can on one side ne null (for open-ended range) as defined also by the STAC Collection extents.

Then you could have something like:

{
...
          "unit": "mm",
          "classification:classes": [
            {
              "value": -1,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "value": -3,
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            },
            {
              "value": [0, null],
              "name": "data",
              "description": "Actual data values in mm"
            }
          ],
...
}
m-mohr commented 2 years ago

One issue that might occur is to define > 0, then you'd need to do something like [0.000000000000000000000000000000000000000001, null]

So an alternative would be to allow a minimal subset of JSON Schema (minimum, maximum, exclusiveMinimum, exclusiveMaximum) and allow an object instead of an array, e.g. for > 0:

{
  "exclusiveMinimum": 0
}
drwelby commented 2 years ago

would it be terrible to be super explicit like

"classification:classes": [
            {
              "value": -1,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "values": [-3, 1, 7],
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            },
            {
              "range": [0, null], # or json schema object...
              "name": "data",
              "description": "Actual data values in mm"
            }
          ],
m-mohr commented 2 years ago

Yes, I think it is terrible ;-) How would you decide whether [1,2] is a range from 1 to 2 or the two categorical values 1 and 2?

drwelby commented 2 years ago

that's why the keys are explicit value, values, and range

m-mohr commented 2 years ago

Ooooh, I didn't catch that difference. Sorry. I don't think that is necessary, it is more complicated to describe and read but doesn't give any obvious benefit to me?

drwelby commented 2 years ago

I just don't like that [1, null] is a magic range while [1, 255] is ambiguous as a range or a list of values.

I do really like ranges that are json schema objects.

and of course I still don't like putting ranges into classes ;), but I want to at least get somewhere with the concept.

m-mohr commented 2 years ago

I did not consider lists of values in my proposal yet because you can emulate them just by having the classes multiple times while you can't reasonable express continuous ranges. But yeah, the full-fledged solution would be:

Not sure whether we should cater for all, the use cases I've heard about so far were only continuous ranges.

m-mohr commented 2 years ago

Alternatively, we could try something like the following although I'm not sure that would be valid in raster as we give "made-up" statistics / exclude no-data values from statistics. It also feels less intuitive.


      "raster:bands": [
        {
          "unit": "mm",
          "data_type": "float64",
          "statistics": {
            "minimum": 0
          },
          "classification:incomplete": true,
          "classification:classes": [
            {
              "value": -1.0,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "value": -3.0,
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            }
          ]
        }
      ],

Thoughts, @emmanuelmathot ?

drwelby commented 2 years ago

Any opinion on preferring

  "maximum": 10.5,
  "exclusiveMaximum": true

versus

  "exclusiveMaximum": 10.5
m-mohr commented 2 years ago

I'd follow JSON Schema as we already use it in other places, which (except for the outdated draft-4) use numbers instead of boolean flags: https://json-schema.org/understanding-json-schema/reference/numeric.html#range

drwelby commented 2 years ago

ok, didn't realize I was looking at an older draft 👍

drwelby commented 2 years ago

"classification:incomplete": true, is interesting to me because saying

"range": [0, null], # or json schema object...
              "name": "data",
              "description": "Actual data values in mm"

seems redundant or out of place when the data set is describing rain fall in mm and a negative depth doesn't make sense.

m-mohr commented 2 years ago

Yeah, I'm liking it more the more I'm thinking about it but it's less flexible and covers only some use cases, I assume. Also, it doesn't seem so wrong to exclude no-data values from statistics because they are usually always just made-up values for the file format that doesn't support encoding them properly. I guess we only need to clarify in raster that no-data values are invalid pixel values and as should not be reflected in statistics etc. On the other hand, statistics are usually real min/max values while what we want to describe here are theoretical min and max values. For example, if you have a raster with precipation values, the min and max could be 1, 5 and 10 so min/max are 1 and 10, although the potential range is 0 to infinity (mostly). But maybe that's not an issue?!

drwelby commented 2 years ago

are we really saying that this is a continuous dataset with classed nodata and should have something roughly like:

"nodata": {
   "classification:classes": {
        ... classes
}

with something else that says that clarifies that the data range of possible values does not include the full range of the datatype?

m-mohr commented 2 years ago

~Hmm, then I still don't have a way to express no-data values and their meanings in STAC. In file it was removed, in raster it got somewhat rejected. I really just want to express -1 is missing value, -3 is no coverage for example. And it seems it would fit in here.~

Sorry, misunderstood you initially. But still not sure, I think I like the proposal above more, because it just adds an additional field hier instead of adding a new data type to an existing field. https://github.com/stac-extensions/classification/issues/33#issuecomment-1171498235

drwelby commented 2 years ago

yes, the question is more "is classification a good enough home for nodata" versus "nodata can be messy enough to warrant some kind of new extension that can use classification if needed" and I understand not wanting to start another extension...

m-mohr commented 2 years ago

Well, nodata is already part of raster so would be a change in that extension. But I don't like putting classification:classes into so many different places. Also, if you have no-data values and categorical values in a file, do you really want to have them in two different places?

drwelby commented 2 years ago

classification: ¯\_(ツ)_/¯: true

drwelby commented 2 years ago

The more I think about it, saying "this dataset uses classes but isn't classified" seems reasonable and simple.

m-mohr commented 2 years ago

I created PR #34 to discuss a potential solution more closely.

emmanuelmathot commented 2 years ago

Alternatively, we could try something like the following although I'm not sure that would be valid in raster as we give "made-up" statistics / exclude no-data values from statistics. It also feels less intuitive.


      "raster:bands": [
        {
          "unit": "mm",
          "data_type": "float64",
          "statistics": {
            "minimum": 0
          },
          "classification:incomplete": true,
          "classification:classes": [
            {
              "value": -1.0,
              "name": "missing-value",
              "description": "Missing value (no-data)",
              "nodata": true
            },
            {
              "value": -3.0,
              "name": "no-coverage",
              "description": "No coverage (no-data)",
              "nodata": true
            }
          ]
        }
      ],

Thoughts, @emmanuelmathot ?

statistics field represents stats about the distribution of ALL pixels in the band ¯_(ツ)_/¯ but using for stats of only VALID PIXELS and thus define boundaries is not strictly forbidden :-). For instance, we use that information to help user selecting the possible range. In this case, this could be interesting.

image

pjhartzell commented 2 years ago

I did not consider lists of values in my proposal yet because you can emulate them just by having the classes multiple times while you can't reasonable express continuous ranges. But yeah, the full-fledged solution would be:

  • integer: single categorical value
  • array of integers: multiple categorical values
  • json schema like object: continuous ranges

Not sure whether we should cater for all, the use cases I've heard about so far were only continuous ranges.

I like the "full-fledged solution". However, even if the array of integers doesn't make it in, I prefer the json schema like object for continuous ranges for its clarity; it also leaves the door open to adding arrays of integers without having to change how continuous ranges are expressed.

Mocking up classification:classes for a VIIRS vegetation index band:

{
    ...
        "scale": 0.0001,
        "data_type": "int16",
        "classification:classes": [
            {
                "value": -13000,
                "name": "fill_land",
                "description": "Fill value over land",
                "nodata": true
            },
            {
                "value": -15000,
                "name": "fill_water",
                "description": "Fill value over ocean or fresh water",
                "nodata": true
            },
            {
                "value": {
                    "minimum": -10000,
                    "maximum": 10000
                },
                "name": "data",
                "description": "Valid range of vegetation index values"
            }
        ],
    ...
}

Perhaps not necessary, but it is nice to be able to describe the valid range of vegetation index data (a defined subset of the possible int16 values).

drwelby commented 2 years ago

To me describing the valid range of a continuous dataset has nothing to do with classification. I'm not sure how a client can or should deal with that class when it isn't a class at all.

pjhartzell commented 2 years ago

@drwelby I see your point, I think. I suppose the same argument could be made for any continuous range? Or is it particular to the valid range?

drwelby commented 2 years ago

To me the valid range is akin to raster:bits_per_sample and should live there.

pjhartzell commented 2 years ago

Yep, I see the connection to bits_per_sample. In this case, the range doesn't fit cleanly into a set number of bits. But I get your point about it not being a class. I'm not concerned about including this information, so we don't need to take this any further. On the face of it, it seemed like it would make sense to describe the data range since the no-data values are also being described. But if there is no value on the client end, then no point. 🙂

m-mohr commented 2 years ago

From the STAC call: No one screamed at me when I said "ranges" are no categories. ;-)

I think we can leave this open for further feedback, but I won't push for a change here. If you only want to describe a single class of valid values (e.g. >= 0), then consider using the statistics or histogram in raster:bands.

pjhartzell commented 1 year ago

Here's an example where allowing a range for the Class object value could have been useful:

image

The cover change values are interpreted as <from class><to class>, e.g., a value of 12 indicates a change from class 1 to class 2. So they could all be mapped to unique categories. But that seems overkill.

m-mohr commented 1 year ago

@pjhartzell How would you want to expose that exactly? 12-21, 23-32, 34-43, ...? or just 12-87?

pjhartzell commented 1 year ago

For this case, [12-21, 23-32, 34-43] would be ideal. [12-87] would be a fallback if multiple ranges can't be expressed.