ks905383 / xagg

Aggregating gridded data (xarray) to polygons
https://xagg.readthedocs.io/
GNU General Public License v3.0
79 stars 15 forks source link

How to handle missing data #78

Open jwyslmh opened 1 week ago

jwyslmh commented 1 week ago

Data sent to Kevin.zip Dear all, Thank you so much for your help!

Attached are the temperature data (downloaded from https://disc.gsfc.nasa.gov/datasets/M2I6NPANA_5.12.4/summary), provincial map shapefile of China (downloaded from https://www.resdc.cn/DOI/DOI.aspx?DOIID=122 ), and the aggregation python code.

The range of original temperature data is [235.44429,306.7428]. The mean is 286.174. More than half of the data are missing. Since the valid data are above 200, and based on the logic "That if a region is covered partially by pixels that are nans, that those are ignored in the aggregation calculation", I am wondering why 'xagg' produces results of 0 (kelvin) and values under 100 (as can be seen in the output 20231231.csv).

Thank you so much! With best wishes, Minhui

ks905383 commented 1 week ago

Hi Minhui,

Thanks for sharing the data that raised the issue you were running into.

I was able to replicate the issue you were running into, but with the following warning raised:

UserWarning: One or more of the pixels in variable T have nans in them in the dimensions time, lev. The code can currently only deal with pixels for which the *entire* pixel has nan values in all dimensions, however there is currently no  support for data in which pixels have only some nan values. The aggregation calculation is likely incorrect.

Here's what's happening:

When xagg calculates overlaps between pixels and polygons, it calculates one map between a certain grid and a certain set of polygons. This implicitly assumes that you can use the same geographic (lat/lon) weights for any observation put into xa.aggregate() - i.e., for any timestep, or any atmospheric layer.

In the data you're working with, since it's on pressure levels, a given location / pressure level / timestep may have data, but a different timestep may not (because the pressure level is below/above topography at a given time). In other words, the same weightmap doesn't apply if it assumes data at all locations that are never nan (since the normalization would be different between timesteps). This is what's causing the warning above to show up, and also the aggregation to be incorrect.

To "fix" this, xagg would need to recalculate the weightmap for every non-geographic coordinate if it recognizes that only some values in a given dimension have nans in the data. I'm a bit loath to do this, however, because in a sense, you're then showing an aggregation over different locations each timestep, but they're implicitly returned as if they were all the same aggregation. To be honest, I'm almost more leaning towards the other extreme, of just returning nans for the whole polygon if there's inconsistent data coverage within (or to allow that as an option in xa.set_options()). Let me know if anyone has thoughts!

In the meantime, here's a workaround if you do want to just have the weightmap recalculated with every timestep/pressure level's configuration of valid data:

df_out = []

# Process for each timestep, level individually
for time in ds.time.values:
    for lev in ds.lev.values:
        # Silencing output to avoid a long list of stdout text
        with xa.set_options(silent=True):
            # Using sel with [] keeps the time, lev columns in the resultant
            # output dataframe, making concatenation easier
            wm = xa.pixel_overlaps(ds.sel(time=[time],lev=[lev]),gdf,subset_bbox=False)
            agg = xa.aggregate(ds.sel(time=[time],lev=[lev]),wm)

        df_out.append(agg.to_dataframe())

# Concatenate into single df
df_out = pd.concat(df_out)

(One quick question, though - I'm not sure if this would be relevant to your work, but is there a specific reason you're using temperature on pressure levels instead of near-surface temperature, or a temperature a set pressure above the ground for what you're looking at? Since right now you're looking at temperature at a different altitude above ground at every location)