Implement land/sea masking using model-specific land/sea area percentage

Joshdpaul commented 7 months ago

Working from the analysis in this notebook, we have determined that we need a more robust process for regridding the land-only and sea-only variables. There are a few unique "scenarios" that the models fall into with regard to treatment of NaN values in land/sea pixels, the general shapes of land/sea masks, and how the data in these different models behave when being regridded. A first draft of a working regridding function for land/sea-only variables is in this notebook.

As of 3/25/24, we are documenting this as a known issue which will need to be revisited when land/sea-only variables are required for computing CMIP6 indicators, or when we have more development time to devote to these specific variables.

Our main goals are:

Regardless of model, ensure that every regridded output has the exact same land mask (i.e., the coastline should appear in the same place in every regridded output for every model)
For any given grid coordinate, all regridded outputs should have either real data values or NaN (i.e., there should never be a case where a given grid coordinate has real data values in some models, but NaN values in others.)

We now know that we cannot use the exact same regridding pipeline as we use for variables with global coverage, because it produces outputs that do not satisfy the goals above. To revise our regridding process to work with land-only and sea-only variables:

Use of zero as a NaN value in some models requires the use of model-specific land area percentage masks; these data are found in the fixed frequency (fx or Ofx) data and will now need to be included in the CMIP6 transfers pipeline.
We cannot assume that fixed frequency datasets use the same grid as the land/sea variables within the same model; there is an additional intra-model regridding step that needs to align these grids in order to create the model-specific land area percentage masks.
Extrapolation is required to fill NaN values inland of the target grid land mask, if they exist.
Using low land-area percentage thresholds to determine the model-specific land mask (e.g. 1%) produces unpredictable results that may not satisfy the goals above. Using higher numbers like (e.g. 99%) produces much more consistent results. However, there needs to be some kind of sensitivity analysis to determine the best land area percentage to use in creating land masks. Especially with larger grid cells (100km), we may lose quite a bit of coastal area by specifying a high threshold for masking. More work needs to be done on this point.

Joshdpaul commented 7 months ago

An additional feature we would like to have in the regridded output for land/sea-only variables is some sort of flag indicating whether or not a given pixel was extrapolated. This would most likely be an additional variable in the regridded dataset that uses only True/False or 0/1 values to represent the extraplolated pixels. To compute this, we may have to regrid the data the "wrong way" (i.e., without using the extrapolation option in the xESMF.Regridder() function) and then compute the difference between the extrapolated regridded dataset and the un-extrapolated version.

charparr commented 7 months ago

Josh and I discussed how to handle the variability in the land/ocean mask across models. The Regridder object in play accommodates masks for both the source and target grids and extrapolates accordingly - but the knob we can turn is the land percentage value used to threshold land vs. ocean grid cells. Currently the threshold choice is 99 which is a conservative and defensible value: we only include grid cells that are almost certain to be terrestrial. However, this approach may indiscriminately discard terrestrial variable data especially at coarse grid resolutions. I suggested two potential approaches to better understand and/or enhance how we handle masking of the source and target datasets:

1.) Naively test many thresholds (i.e. in increments of 5 or 10 between 0 and 100). Examine the response of the resulting masks to understand how sensitive the threshold is and to identify the threshold that captures the most information. This is the method developed by Parr et al. to discriminate different snow classes based on snow depth raster data.

2.) Apply a known threshold algorithm (e.g., Otsu, Li, etc.) and use as way to have a single method, but bespoke threshold value, for all the different models. Many of these are baked into scikit-image, see here for reference and illustration: https://scikit-image.org/docs/stable/auto_examples/applications/plot_thresholding_guide.html

ua-snap / cmip6-utils

Implement land/sea masking using model-specific land/sea area percentage #48