ua-snap / cmip6-utils

Pipelines and utilites for working with CMIP6 data
1 stars 1 forks source link

Implement land/sea masking using model-specific land/sea area percentage #48

Open Joshdpaul opened 7 months ago

Joshdpaul commented 7 months ago

Working from the analysis in this notebook, we have determined that we need a more robust process for regridding the land-only and sea-only variables. There are a few unique "scenarios" that the models fall into with regard to treatment of NaN values in land/sea pixels, the general shapes of land/sea masks, and how the data in these different models behave when being regridded. A first draft of a working regridding function for land/sea-only variables is in this notebook.

As of 3/25/24, we are documenting this as a known issue which will need to be revisited when land/sea-only variables are required for computing CMIP6 indicators, or when we have more development time to devote to these specific variables.

Our main goals are:

We now know that we cannot use the exact same regridding pipeline as we use for variables with global coverage, because it produces outputs that do not satisfy the goals above. To revise our regridding process to work with land-only and sea-only variables:

Joshdpaul commented 7 months ago

An additional feature we would like to have in the regridded output for land/sea-only variables is some sort of flag indicating whether or not a given pixel was extrapolated. This would most likely be an additional variable in the regridded dataset that uses only True/False or 0/1 values to represent the extraplolated pixels. To compute this, we may have to regrid the data the "wrong way" (i.e., without using the extrapolation option in the xESMF.Regridder() function) and then compute the difference between the extrapolated regridded dataset and the un-extrapolated version.

charparr commented 7 months ago

Josh and I discussed how to handle the variability in the land/ocean mask across models. The Regridder object in play accommodates masks for both the source and target grids and extrapolates accordingly - but the knob we can turn is the land percentage value used to threshold land vs. ocean grid cells. Currently the threshold choice is 99 which is a conservative and defensible value: we only include grid cells that are almost certain to be terrestrial. However, this approach may indiscriminately discard terrestrial variable data especially at coarse grid resolutions. I suggested two potential approaches to better understand and/or enhance how we handle masking of the source and target datasets:

1.) Naively test many thresholds (i.e. in increments of 5 or 10 between 0 and 100). Examine the response of the resulting masks to understand how sensitive the threshold is and to identify the threshold that captures the most information. This is the method developed by Parr et al. to discriminate different snow classes based on snow depth raster data.

image

2.) Apply a known threshold algorithm (e.g., Otsu, Li, etc.) and use as way to have a single method, but bespoke threshold value, for all the different models. Many of these are baked into scikit-image, see here for reference and illustration: https://scikit-image.org/docs/stable/auto_examples/applications/plot_thresholding_guide.html