roocs / clisops

Climate Simulation Operations
https://clisops.readthedocs.io/en/latest/
Other
21 stars 9 forks source link

Support in subset_bbox for unstructured grids such as station data #161

Open huard opened 3 years ago

huard commented 3 years ago

Description

Point observations are usually formatted with a variable defined along region and time dimensions. Geographical coordinates (lat, lon) for stations are themselves defined along region. With clisops 0.6.3, this data model is wrongly interpreted as a curvilinear grid, and masks stations outside of the bounding box instead of removing them entirely from the region dimension.

aulemahal commented 3 years ago

Second issue relating to unstructured data: In this example, the 1D coord shared by lat and lon is location and is a list of city names. This makes subset_bbox fail:

from xclim.testing import open_dataset
from clisops.core.subset import subset_bbox

ds = open_dataset("ERA5/daily_surface_cancities_1990-1993.nc")
subset_bbox(ds, lat_bnds=[44, 46], lon_bnds=[-75, -60])

fails with

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-a0ff1095b7a5> in <module>
      1 ds2 = ds.copy()
      2 ds2['location'] = [1, 2, 3, 4, 5]
----> 3 subset_bbox(ds, lat_bnds=[44, 46], lon_bnds=[-75, -60])

~/.conda/envs/xclim/lib/python3.8/site-packages/clisops/core/subset.py in func_checker(*args, **kwargs)
    257                     kwargs[lon][kwargs[lon] < 0] -= 360
    258 
--> 259         return func(*args, **kwargs)
    260 
    261     return func_checker

~/.conda/envs/xclim/lib/python3.8/site-packages/clisops/core/subset.py in subset_bbox(da, lon_bnds, lat_bnds, start_date, end_date, first_level, last_level)
    861             try:
    862                 coords = da[d][ind[i]]
--> 863                 bnds = _check_desc_coords(
    864                     coord=da[d],
    865                     bounds=[coords.min().values, coords.max().values],

~/.conda/envs/xclim/lib/python3.8/site-packages/clisops/core/subset.py in _check_desc_coords(coord, bounds, dim)
    955 def _check_desc_coords(coord, bounds, dim):
    956     """If Dataset coordinates are descending reverse bounds."""
--> 957     if np.all(coord.diff(dim=dim) < 0) and len(coord) > 1:
    958         bounds = np.flip(bounds)
    959     return bounds

~/.conda/envs/xclim/lib/python3.8/site-packages/xarray/core/dataarray.py in diff(self, dim, n, label)
   3105         DataArray.differentiate
   3106         """
-> 3107         ds = self._to_temp_dataset().diff(n=n, dim=dim, label=label)
   3108         return self._from_temp_dataset(ds)
   3109 

~/.conda/envs/xclim/lib/python3.8/site-packages/xarray/core/dataset.py in diff(self, dim, n, label)
   5487             if dim in var.dims:
   5488                 if name in self.data_vars:
-> 5489                     variables[name] = var.isel(**kwargs_end) - var.isel(**kwargs_start)
   5490                 else:
   5491                     variables[name] = var.isel(**kwargs_new)

~/.conda/envs/xclim/lib/python3.8/site-packages/xarray/core/variable.py in func(self, other)
   2299             with np.errstate(all="ignore"):
   2300                 new_data = (
-> 2301                     f(self_data, other_data)
   2302                     if not reflexive
   2303                     else f(other_data, self_data)

TypeError: unsupported operand type(s) for -: 'str' and 'str'
aulemahal commented 3 years ago

I edited my comment above because the first part was no more relevant.

Here are details on the top issue:

MWE, same dataset as my comment above:

from xclim.testing import open_dataset
from clisops.core.subset import subset_bbox

ds = open_dataset("ERA5/daily_surface_cancities_1990-1993.nc")

# Trick to avoid error mentioned above.
ds2 = ds.copy() 
ds2['location'] = [1, 2, 3, 4, 5]

subset_bbox(ds2, lat_bnds=[45, 50], lon_bnds=[-125, -70])

Those lat and lon bounds are supposed to select locations 2 and 5. As expected, location 1 is dropped. However elements 3 and 4 are still in the output dataset, but with NaNs everywhere, masked instead of dropped.

I think this comes from the fact that the lat/lon bounds are translated to bounds on the shared dim (location), as would make sense with a curvilinear grid. May be a special case for unstructured grids would be nice (the case when lat and lon are 1D and share the same dimension).