Discussion: add a masking API to odc.geo to help people do opinionated masking of well known products

alexgleith commented 8 months ago

As discussed on Slack, the OpenDataCube has metadata and functionality to handle masking certain flags in a simple way.

It's not documented, whoops, but has a bunch of code around it.

The new odc-stac tool doesn't use the ODC metadata, and as such, I've been doing cloud masking ad-hoc all over the place.

I think we should set up some kind of opinionated mask tooling, and after discussion, @robbibt and @Kirill888 both agree that it makes sense to do it here in odc-geo.

This discussion is to capture thoughts on what a high- and low-level API might like and what they'll do.

alexgleith commented 8 months ago

Some thoughts.

Alex thinks something simple like baking in the SCL, fmask and qa_pixel lookup tables, would be great. Then being able to do something like masking.make_mask(ds.scl, [masking.ls.CLOUD, masking.ls.CLOUD_SHADOW], **opts) would be fantastic!

@Kirill888 said:

I was thinking of defining something along these lines for selecting bits on the lower level: bits2bool(xx.mask, 'XXX0_1XXX') true when bit3=1, bit4=0 and rest is whatever bits2bool(xx.mask, 'XXX0_1XXX', invert=True) false when bit3=1, bit4=0 and rest is whatever bits2bool(xx.mask, 'XXX0_1XXX', '11XX_XXXX') true when either of the masks match

and:

And maybe for consistency have something like enum2bool(xx.mask, val1, val2) which basically just calls out to .isin.

@robbibt said:

Yeah, the current make_mask func could definitely be improved, but the nice thing about it is that it really turns abstract bits into something that's intuitive and understandable for beginners. I'm not against an updated syntax either, but it would be really neat to maintain that usability - bit flags can get super super confusing (I still struggle with them after using them for several years...)

robbibt commented 8 months ago

This would make masking so much easier throughout the ODC ecosystem.

For usability, it would be fantastic to be able to easily compute masks using the existing .odc. extensions, rather than having to import a specific function like we currently do in datacube. For example, something like this would be extremely useful:

ds.fmask.odc.make_mask(...)

ds.red.odc.mask_invalid_data()

etc.

Once datacube is updated to import odc-geo (happening in version 1.9? @SpacemanPaul @Ariana-B?), this would essentially allow users to use the same masking logic regardless of where they load their data from (e.g. STAC or datacube itself).

alexgleith commented 8 months ago

Do you think it's a good idea to do mask_invalid_data() as an opinionated method?

I kind of think it's better to keep it a bit lower, so folks know what they're doing.

That said, the subtleties of choosing which cloud flags and options around opening/dilation etc. are pretty damn fiddly and I use other people's values all the time!

robbibt commented 8 months ago

Do you think it's a good idea to do mask_invalid_data() as an opinionated method?

I kind of think it's better to keep it a bit lower, so folks know what they're doing.

That said, the subtleties of choosing which cloud flags and options around opening/dilation etc. are pretty damn fiddly and I use other people's values all the time!

All mask_invalid_data does is set native nodata values to NaN, which is usually one of the very first steps taken across various science applications - having an easy way to do that automatically (without having to manually check nodata values on each array) would definitely be very useful functionality to expose.

Creating a custom mask from a band like pixel_quality/fmask then applying it to a dataset is definitely more complicated - I don't think we'd want to hard code any of those decisions. 🙂

cbur24 commented 8 months ago

Just adding my two cents here which is largely just to support the idea/comments above, and to provide an example of similar functionality that exists elsewhere. Something I've learned that's great about working with GEE in python are the wrapper functions groups have built around QA masking. For example, eemont provides accessor functions for rescaling and masking common surface reflectance datasets e.g.

s2 = (ee.ImageCollection('COPERNICUS/S2_SR')
s2 = s2.maskClouds()
s2 = s2.scaleAndOffset()

Or both steps can be combined into the simpler s2.preprocess(), in which case opinionated defaults are used to provide the user with a ready-to-analyse dataset. I believe the code that underlies the cloud masking is here

Given there are multitude of provenances for xarray datasets something as clean as this probably isn't possible or desirable, but if we could get to the stage of something like ds.odc.mask_clouds(product='s2', qa_band='scl', **args) then I think that would be a fantastic feature. But even if it only goes as far as ds.fmask.odc.make_mask(...) then that's still awesome.

I doubt I'd be much use in any back-end work on this but if you're looking for test dummies later on I'm happy to test the functions and provide feedback.

Kirill888 commented 8 months ago

Alright we are getting into many other desired functionalities here, which is fine but they probably deserve their own issue.

We can keep this one as a discussion related to masking of pixels, and create new issue for specific work tasks.

I figure is that there are several lower level functionalities related to "pixel masking":

Constructing boolean masks from some kind of QA band
- enum type masks (categorical data, e.g. fmask, xr.isin can be used here)
- pure bitmasks (boolean array packed into an int, e.g. LS9.QA_RADSAT)
- bit-packed small ints and bools (e.g. LS9.QA_PIXEL, some flags, some 2-bit categorical variables)
Nontrivial "cleanup" operations on masks (some combinations of dilate/erode for example)
Applying masks to data, xarray.where kinda thing, possibly with type conversion
Converting from int^nodata to float^NaN optionally with with scale and offset applied to pixel values

A lot of these can be implemented efficiently using numexpr, although it would be nice to have this work without numexpr being available.

robbibt commented 8 months ago

Alright we are getting into many other desired functionalities here, which is fine but they probably deserve their own issue.

We can keep this one as a discussion related to masking of pixels, and create new issue for specific work tasks.

I figure is that there are several lower level functionalities related to "pixel masking":

Constructing boolean masks from some kind of QA band

enum type masks (categorical data, e.g. fmask, xr.isin can be used here)

pure bitmasks (boolean array packed into an int, e.g. LS9.QA_RADSAT)

bit-packed small ints and bools (e.g. LS9.QA_PIXEL, some flags, some 2-bit categorical variables)

Nontrivial "cleanup" operations on masks (some combinations of dilate/erode for example)

Applying masks to data, xarray.where kinda thing, possibly with type conversion

Converting from int^nodata to float^NaN optionally with with scale and offset applied to pixel values

Yep, I think those functionalities would address all the main pixel masking-related tasks I can think of.

A bunch of the existing masking tools from the _masking.py module in odc.algo would be great to have exposed in odc.geo instead: https://github.com/opendatacube/odc-algo/blob/main/odc%2Falgo%2F_masking.py

Kirill888 commented 8 months ago

Related STAC extenion

https://github.com/stac-extensions/classification

Example of use:

https://planetarycomputer.microsoft.com/api/stac/v1/collections/landsat-c2-l2/items/LC09_L2SR_080122_20240113_02_T2

curl -s https://planetarycomputer.microsoft.com/api/stac/v1/collections/landsat-c2-l2/items/LC09_L2SR_080122_20240113_02_T2 \
  | jq .assets.qa_pixel

Sample Asset JSON

```json { "href": "https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2024/080/122/LC09_L2SR_080122_20240113_20240114_02_T2/LC09_L2SR_080122_20240113_20240114_02_T2_QA_PIXEL.TIF", "classification:bitfields": [ { "name": "fill", "length": 1, "offset": 0, "classes": [ { "name": "not_fill", "value": 0, "description": "Image data" }, { "name": "fill", "value": 1, "description": "Fill data" } ], "description": "Image or fill data" }, { "name": "dilated_cloud", "length": 1, "offset": 1, "classes": [ { "name": "not_dilated", "value": 0, "description": "Cloud is not dilated or no cloud" }, { "name": "dilated", "value": 1, "description": "Cloud dilation" } ], "description": "Dilated cloud" }, { "name": "cirrus", "length": 1, "offset": 2, "classes": [ { "name": "not_cirrus", "value": 0, "description": "Cirrus confidence is not high" }, { "name": "cirrus", "value": 1, "description": "High confidence cirrus" } ], "description": "Cirrus mask" }, { "name": "cloud", "length": 1, "offset": 3, "classes": [ { "name": "not_cloud", "value": 0, "description": "Cloud confidence is not high" }, { "name": "cloud", "value": 1, "description": "High confidence cloud" } ], "description": "Cloud mask" }, { "name": "cloud_shadow", "length": 1, "offset": 4, "classes": [ { "name": "not_shadow", "value": 0, "description": "Cloud shadow confidence is not high" }, { "name": "shadow", "value": 1, "description": "High confidence cloud shadow" } ], "description": "Cloud shadow mask" }, { "name": "snow", "length": 1, "offset": 5, "classes": [ { "name": "not_snow", "value": 0, "description": "Snow/Ice confidence is not high" }, { "name": "snow", "value": 1, "description": "High confidence snow cover" } ], "description": "Snow/Ice mask" }, { "name": "clear", "length": 1, "offset": 6, "classes": [ { "name": "not_clear", "value": 0, "description": "Cloud or dilated cloud bits are set" }, { "name": "clear", "value": 1, "description": "Cloud and dilated cloud bits are not set" } ], "description": "Clear mask" }, { "name": "water", "length": 1, "offset": 7, "classes": [ { "name": "not_water", "value": 0, "description": "Land or cloud" }, { "name": "water", "value": 1, "description": "Water" } ], "description": "Water mask" }, { "name": "cloud_confidence", "length": 2, "offset": 8, "classes": [ { "name": "not_set", "value": 0, "description": "No confidence level set" }, { "name": "low", "value": 1, "description": "Low confidence cloud" }, { "name": "medium", "value": 2, "description": "Medium confidence cloud" }, { "name": "high", "value": 3, "description": "High confidence cloud" } ], "description": "Cloud confidence levels" }, { "name": "cloud_shadow_confidence", "length": 2, "offset": 10, "classes": [ { "name": "not_set", "value": 0, "description": "No confidence level set" }, { "name": "low", "value": 1, "description": "Low confidence cloud shadow" }, { "name": "reserved", "value": 2, "description": "Reserved - value not used" }, { "name": "high", "value": 3, "description": "High confidence cloud shadow" } ], "description": "Cloud shadow confidence levels" }, { "name": "snow_confidence", "length": 2, "offset": 12, "classes": [ { "name": "not_set", "value": 0, "description": "No confidence level set" }, { "name": "low", "value": 1, "description": "Low confidence snow/ice" }, { "name": "reserved", "value": 2, "description": "Reserved - value not used" }, { "name": "high", "value": 3, "description": "High confidence snow/ice" } ], "description": "Snow/Ice confidence levels" }, { "name": "cirrus_confidence", "length": 2, "offset": 14, "classes": [ { "name": "not_set", "value": 0, "description": "No confidence level set" }, { "name": "low", "value": 1, "description": "Low confidence cirrus" }, { "name": "reserved", "value": 2, "description": "Reserved - value not used" }, { "name": "high", "value": 3, "description": "High confidence cirrus" } ], "description": "Cirrus confidence levels" } ], "type": "image/tiff; application=geotiff; profile=cloud-optimized", "roles": [ "cloud", "cloud-shadow", "snow-ice", "water-mask" ], "title": "Pixel Quality Assessment Band", "description": "Collection 2 Level-1 Pixel Quality Assessment Band (QA_PIXEL)", "raster:bands": [ { "unit": "bit index", "nodata": 1, "data_type": "uint16", "spatial_resolution": 30 } ] } ```

JonDHo commented 7 months ago

Alright we are getting into many other desired functionalities here, which is fine but they probably deserve their own issue.

We can keep this one as a discussion related to masking of pixels, and create new issue for specific work tasks.

I figure is that there are several lower level functionalities related to "pixel masking":

Constructing boolean masks from some kind of QA band

enum type masks (categorical data, e.g. fmask, xr.isin can be used here)

pure bitmasks (boolean array packed into an int, e.g. LS9.QA_RADSAT)

bit-packed small ints and bools (e.g. LS9.QA_PIXEL, some flags, some 2-bit categorical variables)

Nontrivial "cleanup" operations on masks (some combinations of dilate/erode for example)

Applying masks to data, xarray.where kinda thing, possibly with type conversion

Converting from int^nodata to float^NaN optionally with with scale and offset applied to pixel values

A lot of these can be implemented efficiently using numexpr, although it would be nice to have this work without numexpr being available.

I agree with these points as well and have a few additional thoughts/comments:

scaling & offset seems to be becoming more complex rather than less, for example the move from Landsat collection 1 to 2 became more confusing for users (compared to just dividing by 10000) and the new Sentinel 2 collection from Element 84 going the same way. It is not clear to me yet whether the new S2 collection will have a consistent required offset across all scenes, or whether we are going to have to discover the relevant offset from metadata per scene and apply it on-the-fly... this is a slightly more general issue but in a perfect world, I think dc.load() should include some options related to scaling and offset which could be automatically applied from product definitions, dataset metadata or file metadata if available.
I know it isn't necessarily that common, but there are some use cases where more complex masking is required. One example that I have is Sentinel-3 data, where there can be multiple qa flags on one pixel within the one qa "measurement". This is most likely just a case of having suitable examples of use, but it is important to make sure that any api facilitates this use as well.
another common requirement I come across is to summarise the number of different flags in an area of interest through time (e.g. number of cloudy pixels in a dataset through time). This is obviously easy enough to do, but perhaps a utility function could be useful.

robbibt commented 7 months ago

scaling & offset seems to be becoming more complex rather than less, for example the move from Landsat collection 1 to 2 became more confusing for users (compared to just dividing by 10000) and the new Sentinel 2 collection from Element 84 going the same way. It is not clear to me yet whether the new S2 collection will have a consistent required offset across all scenes, or whether we are going to have to discover the relevant offset from metadata per scene and apply it on-the-fly... this is a slightly more general issue but in a perfect world, I think dc.load() should include some options related to scaling and offset which could be automatically applied from product definitions, dataset metadata or file metadata if available.

I know it isn't necessarily that common, but there are some use cases where more complex masking is required. One example that I have is Sentinel-3 data, where there can be multiple qa flags on one pixel within the one qa "measurement". This is most likely just a case of having suitable examples of use, but it is important to make sure that any api facilitates this use as well.

another common requirement I come across is to summarise the number of different flags in an area of interest through time (e.g. number of cloudy pixels in a dataset through time). This is obviously easy enough to do, but perhaps a utility function could be useful.

I agree that scaling/offsets are a huge pain point for our users - anything we can do to make that stuff easier would be extremely valuable. The masking tools from odc.algo support re-scaling - being able to access similar functionality easily (and even at load time) would be really helpful.

caitlinadams commented 6 months ago

Thanks for the great discussion so far. I agree that opionated cloud masking is a really useful function, but am curious about whether it belongs in odc-geo, or whether it should have a different home. If you're interested, please have a look at this discussion I've started: https://github.com/opendatacube/datacube-core/discussions/1566

opendatacube / odc-geo

Discussion: add a masking API to odc.geo to help people do opinionated masking of well known products #117