intake / intake-esm

An intake plugin for parsing an Earth System Model (ESM) catalog and loading assets into xarray datasets.
https://intake-esm.readthedocs.io
Apache License 2.0
137 stars 46 forks source link

Support multiple variable files #112

Closed matt-long closed 4 years ago

matt-long commented 5 years ago

There is a widespread assumption in intake-esm that there is one variable per file. This precludes using the package with multi-variable files, such as those written directly by CESM.

Automated "collection" generation could be very complicated with multi-variable files. For instance, right now, we get variable names from the directory structure or file name. Opening each file to get a list of variables could be very time-consuming—and near impossible for remote resources like HPSS.

I suspect, however, that it's not too difficult to extend the code to use collections built with multi-variable files—though there may be limitations associated with concatenating many files into a single dataset. If we can imagine relying on external information to build collections, perhaps we can start addressing this issue by redesigning the collection structure and aggregate.py to accommodate multiple variables per file.

In #98, I commented

Some collection columns are user-facing attributes of the dataset; the CESM-LE should have the same user-facing attributes regardless of the platform. Other collection columns include details about how we're accessing the data. We might consider separating these more explicitly and the structures we build to do this may alleviate a need for separate treatment.

Right now, we are using file_basename as a unique identifier. I think a first step would be to redesign the collection so that a definition specifies the list of attributes (i.e., for CESM, something like: experiment, component, stream, variable) that comprises a dataset granule; each granule can have a unique key and methods to get the associated files, be they multi-variable or not, local or remote.

pbranson commented 5 years ago

Not sure if this helps, but I have successfully combined files with several variables using intake_esm in the function _delayed_stack_forecasts here https://bitbucket.csiro.au/users/bra467/repos/bluelink_intake_esm/browse/intake_bluelink.py

It is pretty long as I tried to write a generic function for our specific model forecasts that deals with

  1. All variables are in each file
  2. Forecasts that may be separated by time step, or multiple timesteps per file
  3. The overlap across forecasts
  4. Subsetting of the variables returned (data_vars parameter)
  5. Cropping of variable upon open based on dictionary using .sel and a dictionary (bbox parameter)

The disadvantage of this approach is that the catalog entries don't contain a list variables available in the file, so a user needs to open one file to see what is in it first.

aaronspring commented 4 years ago

I use intakeesm with the proprocess argument on multi variable files to extract one variable only. But the concatenation of these larger files takes also more time.