Closed matt-long closed 4 years ago
Not sure if this helps, but I have successfully combined files with several variables using intake_esm in the function _delayed_stack_forecasts
here https://bitbucket.csiro.au/users/bra467/repos/bluelink_intake_esm/browse/intake_bluelink.py
It is pretty long as I tried to write a generic function for our specific model forecasts that deals with
The disadvantage of this approach is that the catalog entries don't contain a list variables available in the file, so a user needs to open one file to see what is in it first.
I use intakeesm with the proprocess argument on multi variable files to extract one variable only. But the concatenation of these larger files takes also more time.
There is a widespread assumption in intake-esm that there is one variable per file. This precludes using the package with multi-variable files, such as those written directly by CESM.
Automated "collection" generation could be very complicated with multi-variable files. For instance, right now, we get variable names from the directory structure or file name. Opening each file to get a list of variables could be very time-consuming—and near impossible for remote resources like HPSS.
I suspect, however, that it's not too difficult to extend the code to use collections built with multi-variable files—though there may be limitations associated with concatenating many files into a single dataset. If we can imagine relying on external information to build collections, perhaps we can start addressing this issue by redesigning the collection structure and aggregate.py to accommodate multiple variables per file.
In #98, I commented
Right now, we are using
file_basename
as a unique identifier. I think a first step would be to redesign the collection so that a definition specifies the list of attributes (i.e., for CESM, something like: experiment, component, stream, variable) that comprises a dataset granule; each granule can have a unique key and methods to get the associated files, be they multi-variable or not, local or remote.