pangeo-data / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
183 stars 32 forks source link

Clarification on xESMF treatment of masks and missing values #256

Open blimlim opened 1 year ago

blimlim commented 1 year ago

Hi everyone!

I have been regridding some global ocean data and was hoping for some clarification on how xESMF handles masks and missing values, as I've noticed some behaviour I don't quite understand. I've read through #22 of the old repository but didn't see this behaviour appear there.

I have data covering the globe on a lon/lat grid, which I want to interpolate to a coarser grid using bilinear interpolation. I only want to use the ocean points, so have tried a few different methods of masking out the land:

1. Supplying an input binary mask to the input grid in the xe.Regridder call: This results in the following: regrid_input_mask (Figure 1)

2. Setting all land values to nan, and regridding with no masks supplied to xe.Regridder From what I understand, any output grid points which are next to an nan on the input data will be set to nan. This results in regrid_input_nand (Figure 2)

Looking closer at the above two results, the first outputs values at more points, specifically along the western coasts. The difference between the two results is shown below: diff_input_masked_input_nand (Figure 3)

I was wondering if anyone knows how these extra values are calculated? In #22 of the old repository, it sounds like providing an input mask is equivalent to setting the inputs under the mask to 0, and then regridding. However, if I do this manually without a mask, I get a different result again:

3. Setting land points to 0, and regridding with no masks supplied to xe.Regridder regrid_input_zeroed (Figure 4) This results in soft edges at the boundaries, which makes sense, however this is quite different to the first plot.

The main thing I'm hoping to find out is how the extra values in Figure 1 compared to Figure 2 are calculated. If they are a result of an interpolation with 0 at the missing points, it might be best for me not not use them in case the values are artificially low.

Let me know if there's any data/code or more info I can provide that would be helpful. I tried to set up a simple artificial example but wasn't able to get the same behaviour with it (methods 1 and 2 gave the same results when I tried using some simple constructed data).

Thank you for your help with this!

raphaeldussin commented 1 year ago

To regrid to a coarser grid, I recommend you use the conservative_normed method, passing the masks for both input and output grids. without mask defined, land values (NaN or zeros) can bleed into the regridded field.

Thomas-Moore-Creative commented 1 year ago

Hi @raphaeldussin - I've just bumped into this issue thread. Thanks to @blimlim for posting and your answer.

WRT the above are their any good, clear examples on using the conservative_normed method? My experience is that it's possible some xESMF users are not as careful and thoughtful as @blimlim and just press ahead with bilinear and use the possibly problematic result? Worst case is errors in coastal cells are not picked up and used as good data?

One challenge might be understanding how to easily generate the "output" mask for the coarser grid of choice. Are there any examples we can point people to specifically on that?

The answer probably is "just go RTF manual" but if there are any specific pages / examples you could point to that would be helpful?

raphaeldussin commented 1 year ago

indeed the docs is a good starting point ;) https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html

Also, conservative normed does a better job conserving statistics across resolutions.

blimlim commented 1 year ago

Thanks @raphaeldussin and @Thomas-Moore-Creative for your help with this!

Just hoping to confirm, to use the conservative normed regridding, will I need to also supply the corners of the grid cells? I'll try to obtain these.

I'll be using these files as ancillaries for an atmosphere model, and so am thinking it will be important for them to be periodic in longitude. I've read that the periodic parameter is forced to be False when using the conservative normed method. Would you know if this is still correct?

raphaeldussin commented 1 year ago

yes to both. there is an issue #28 opened about periodicity and conservative that is yet to be investigated. That might be something that might interest you @blimlim, probably not much code changes necessary here.

aulemahal commented 1 year ago

@blimlim If your data is on a regular grid (1D lat and 1D lon), then cf-xarray can guess the bounds for you. It is a dependency of xESMF so it could even be done automatically under-the-hood.

As for the periodicity issue, from my experience, as long as your bounds are periodic, the result should respect this periodicity. I mean, the west bounds of the westmost cell should be the exact same as the east bound of the eastmost cell. Dependending on the exact coordinates, there is a probability that guessed bounds will not be perfect, in which case model-provided bounds would be the best alternative.

blimlim commented 1 year ago

Thanks @raphaeldussin and @aulemahal for your help! I do have access to the grid bounds, so I'll have a go at the conservative normed regridding. At the moment I'm new to the field so still getting my head around everything, but if in the future the periodicity question hasn't yet been looked at, that would be interesting to investigate!

maresb commented 6 days ago

Hey, as a first-time user I was just in the same situation as @blimlim (needing to coarsely regrid SST data containing NaNs), and I feel like I spent a lot of time going down the wrong path before I found a simple and clean way of doing "the right thing".

Like @blimlim, I also started on https://github.com/JiaweiZhuang/xESMF/issues/22. That seems to rank high in search engine results, so I think if we could agree on some official advice, it would be great to crosspost there to ensure people discover the answer soon.

I also spent a lot of time trying to understand the [comparison of regridding algorithms notebook](https://github.com/pangeo-data/xESMF/blob/master/doc/notebooks/Compare_algorithms.ipynb which I think was not so helpful to me since the notebook doesn't address NaN values, and it fails to provide an example that distinguishes conservative and conservative_normed.

Some time later I discovered the masking notebook, and that focuses on this notion of a binary mask that should be added as a data variable based on the location of the NaN values of the input data. This is pretty messy, requiring some awkward boilerplate to set up by hand. After going through the steps, I was dismayed that the result was representing NaN values with 0.

Then hidden away near the end of the masking notebook I noticed this lovely "Adaptive masking" section. You simply add skipna=True to your regridder, and it fills in as many pixels as possible. In case you want to have a threshold for the amount of missing data, adjust na_thres (the default 1.0 means fill in everything possible, while 0.0 propagates all NaNs).

Thus I believe the standard recommendation should be:

To decrease resolution,

regridder = xesmf.Regridder(ds, ds_coarse_template, method="conservative")
ds_coarse = regridder(
    ds,
    skipna=True,
    na_thres=1.0,  # lowering this results in more NaNs in ds_coarse
)

And conservative should be preferred over bilinear when decreasing resolution because bilinear will throw away many of the input points. On the other hand, conservative considers all the overlapping input points and thus produces a more stable result with less noise.

To increase resolution,

regridder = xesmf.Regridder(ds, ds_fine_template, method="bilinear")
ds_fine = regridder(
    ds,
    skipna=True,
    na_thres=1.0,  # lowering this results in more NaNs in ds_fine
)

And bilinear is preferable to conservative for increasing resolution because it produces a smoother interpolation.

As far as I can tell, adaptive masking makes conservative_normed method obsolete since it corresponds to na_thres=1.0. Furthermore, it seems to me like the whole manual masking process described in the first ¾ of the masking notebook is obsolete.

My advice above so far ignores the patch method, my impression being that it's an advanced method for producing extra-smooth interpolation for when bilinear isn't good enough. I also ignore nearest_s2d and nearest_d2s which seem too primitive for most use cases.

Is my understanding at least somewhat correct? If so, would it be helpful to update the documentation to make this more modern advice more prominent? (And perhaps merging the masking and comparison notebooks?) Thanks!