pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

droping variables when accessing remote datasets via pydap #8895

Closed Mikejmnez closed 2 months ago

Mikejmnez commented 7 months ago

Is your feature request related to a problem?

I ran into the following issue when trying to access a remote dataset. Here is the concrete example that reproduces the error.

from pydap.client import open_url
from pydap.cas.urs import setup_session
import xarray as xr
import numpy as np

username = "_UsernameHere_"
password= "_PasswordHere_"
filename = 'Daymet_Daily_V4R1.daymet_v4_daily_na_tmax_2010.nc'
hyrax_url = 'https://opendap.earthdata.nasa.gov/collections/C2532426483-ORNL_CLOUD/granules/'
url1 = hyrax_url + filename
session = setup_session(username, password, check_url=hyrax_url)  

ds = xr.open_dataset(url1, engine="pydap", session=session)

The last line returns an error:

ValueError: dimensions ('time',) must have the same length as the number of data dimensions, ndim=2

The issue involves the variable time_bnds. I know that because this works:

DS = []
for var in [var for var in tmax_ds.keys() if var not in ['time_bnds']]:
    DS.append(xr.open_dataset(url1+'?'+var, engine='pydap', session=session))
ds = xr.merge(DS)

I also tried passing decode_times=False but continue having the error. The above for loop works but I think unnecessarily too slow (~30 secs).

I tried all this with the newer versions of xarray.__version__ = [2024.2, 2024.3].

Describe the solution you'd like

I think it would be nice to be able to drop the variable I know I don't want. So something like this:

ds = xr.open_dataset(url1, drop_variables='time_bnds', engine="pydap", session=session)

and only create a xarray.dataset with the variables I want. However when I do that I continue to have the same error as before, which means that drop_variables is being applied after creating the xarray.dataset.

Describe alternatives you've considered

This is potentially a backend issue with pydap - which does not take a drop_variables option, but since dropping a variable is a one-liner in pydap and takes less than 1milisec, it makes it an desirable feature.

For example I can easily open the dataset and drop the variable with pydap as described below

$ dataset = open_url(url1, session=session) # this works
$ dataset[tuple([var for var in dataset.keys() if var not in ['time_bnds']])] # this takes < 1ms 
>>> <DatasetType with children 'y', 'lon', 'lat', 'time', 'x', 'tmax', 'lambert_conformal_conic', 'yearday'>

It looks like it would be a easy implementation on the backend, but at the same time I took a look at pydap_.py

https://github.com/pydata/xarray/blob/b80260781ee19bddee01ef09ac0da31ec12c5152/xarray/backends/pydap_.py#L129-L130

and I feel like it could also be implemented at the xarray level by allowing drop_variables which is already an argument in xarray.open_dataset, to be passed to the PydapDataStore (I guess in both scenarios drop_variables would be passed).

Any thoughts or suggestions? I can certainly lead on this effort as I already will be working on enabling the dap4 implementation within pydap.

dcherian commented 6 months ago

Passing drop_variables down to the backend seems like a good idea but will take some effort to implement across all backends.

Do you know why it's only reporting one dimension name for a 2D variable?

Mikejmnez commented 6 months ago

Passing drop_variables down to the backend seems like a good idea but will take some effort to implement across all backends.

yeah, totally sound fair.

Do you know why it's only reporting one dimension name for a 2D variable?

I think the problem is within time_bnds[time, nv] and particularly the dimension nv=[0,1]. nv is listed as a global dimension in the attributes but it is not actually defined in the array. That is why dropping time_bnds also gets rid of the problem. Some of these older files frustratingly do that (this one is from 2010) and, from what I understand because they are used for validation tests, it is hard to changed them.

FYI: one can always inspect the metadata of a file by appending a .dmr or .html to the filename (for NASA files you may have to log first via Earth Data) https://opendap.earthdata.nasa.gov/collections/C2532426483-ORNL_CLOUD/granules/Daymet_Daily_V4R1.daymet_v4_daily_na_tmax_2010.nc.dmr