intake / intake-esm

An intake plugin for parsing an Earth System Model (ESM) catalog and loading assets into xarray datasets.
https://intake-esm.readthedocs.io
Apache License 2.0
130 stars 42 forks source link

Cannot open coordinate variables from Intake-ESM datastores #660

Open dougiesquire opened 3 months ago

dougiesquire commented 3 months ago

Description

I'd like to include coordinate variables in my Intake-ESM datastore so that they can be searched for and opened like any other variable. However, Intake-ESM currently only allows opening the xarray data_vars from an asset. If I try to open a coordinate variable from my datastore I am returned an empty xarray Dataset.

The obvious fix is to edit the line in the link above to include any variable in ds, but there may be a reason I don't understand for why things are done how they are currently.

What I Did

To reproduce requires a multi-variable catalog that includes a coordinate variable in the lists of variables. E.g. one can edit the first row of tests/sample-catalogs/multi-variable-catalog.csv to include "TLAT" in the list of variables. Then:

import intake

cat = intake.open_esm_datastore(
    "tests/sample-catalogs/multi-variable-catalog.json",
    columns_with_iterables=["variable"]
)

cat.search(variable="TLAT").to_dask()

returns a Dataset that does not contain "TLAT"

Version information: output of intake_esm.show_versions()

INSTALLED VERSIONS ------------------ cftime: 1.6.3 dask: 2024.2.1 fastprogress: 1.0.3 fsspec: 2024.2.0 gcsfs: 2024.2.0 intake: 0.7.0 intake_esm: 2023.6.14.post31+g141425b netCDF4: 1.6.5 pandas: 2.2.1 requests: 2.31.0 s3fs: 2024.2.0 xarray: 2024.2.0 zarr: 2.17.0
andersy005 commented 3 months ago

👋🏽 @dougiesquire,

the primary reason why we are using ds.data_vars instead of ds.variables is that we don't want xarray to perform unnecessary and expensive checks on the coordinates when merging or concatenating datasets. In the past, merging and concatenating model data produced on different machines often failed due to roundoff errors in coordinates, which was a considerable issue. by using data_vars(), we were able to bypass these unnecessary coordinate comparisons (e.g. when concatenating two variables from same model but with compatible grids) in xarray. however, I'm not sure if this is still necessary.

So, I'm in favor of trying to relax this requirement to see what happens. another option would be to introduce a configurable setting that allows users to turn on the use of ds.variables in place of ds.data_vars. this latter approach is likely better if we aim to ensure backward compatibility.

dougiesquire commented 3 months ago

Hello and thanks @andersy005! I could well be misunderstanding, but I think using ds.variables would only change things for users who have included coordinate variables in their catalog (and presumably want to be able to open them). I don’t understand why there would be any performance implications: variables is constructed only from variables included in the catalog, so I don’t think any additional coordinates should be loaded unless they are explicitly asked for from the catalog. FWIW, changing to ds.variables doesn’t appear to make the tests run any slower.

If you agree, I advocate that we don’t add the additional complexity of making this change configurable.

mgrover1 commented 2 months ago

@dougiesquire - checking back in here, are you interested in making a pull request with the fix you mentioned?