intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
76 stars 36 forks source link

caching netcdf from DODS #71

Closed aaronspring closed 4 years ago

aaronspring commented 4 years ago

I want to process forecasts from iridl.ldeo.columbia.edu/. Many models, many variables, forecast or hindcast, all follow the same URL pattern. Data is stored on dods like http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.CESM/.46LCESM1/.hindcast/.ua/dods

I build a catalog. It works. For multiple use, I want to use intake caching, this fails.

SubX.yml:

# http://iridl.ldeo.columbia.edu/SOURCES/.Models/overview.html
plugins:
  source:
    - module: intake_xarray
sources:
  subX:
    description: SubX
    driver: netcdf
    metadata:
      url_origin: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/
    #cache:
    #  - argkey: urlpath
    #    regex: ''
    #    type: file # tree # dir
    parameters:
      model:
        description: model
        type: str
        default: NCEP
        allowed: [CESM, ECCC, EMC, ESRL, GMAO, NCEP, NRL, RSMAS]
      subdataset:
        description: subdataset
        type: str
        default: CFSv2
        allowed: [
          30LCESM1, 46LCESM1, # CESM
          GEM, GEPS5, GEPS6, #ECCC
          GEFS, #EMC
          FIMr1p1, #ESRL
          CFSv2, #NCEP
          GEOS_V2p1, # GMAO
          NESM, #NRL
          CCSM4, #RSNAS
        ]
      cast:
        description: hindcast or forecast
        type: str
        default: hindcast
        allowed: [hindcast, forecast]
      variable:
        description: variable name
        type: str
        default: ts
        allowed: [ts, zg, va, ua, tas, rlut, pr, hfls, hfss, huss, mrso, psl, rad, ROMI, snc, stx, sty, swe, tasmax, tasmin, uas, vas, wap]
    args:
      urlpath: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.{{model}}/.{{subdataset}}/.{{cast}}/.{{variable}}/dods
      chunks: {'S': 30, 'L': 5}
import intake

obs = intake.open_catalog('SubX.yml')
list(obs)

ds = obs.subX.to_dask()
import dask
ds.dims
Frozen(SortedKeysDict({'S': 27389, 'M': 1, 'X': 360, 'L': 44, 'Y': 181}))
dask.utils.format_bytes(ds.nbytes)
314 GB

The goal would be to subset first some S or L first and then cache.

When I uncomment the cache lines, it does not work. I browsed the docs for intake and intake-xarray, except for the examples I didnt find much information about how to use caching.

Question: Is the combination of DODS netcdf and caching even theoretically possible? If so, any suggestions for how to configure cache in the catalog?

I checked the code base of intake-xarray and it seems like all the caching is inherited from intake, however, I hoped to find an answer to this question rather here, because I think its more connected to netcdf and DODS.

martindurant commented 4 years ago

I really don't know the mechanism of netcdf/dods (or opendap) - does xarray or it's backend treat such a URL specially? Accessing your URL doesn't actually download any data, just shows a HTML page about the data, so I guess something else is happening. Caching in Intake, whether the "old" version you are trying to use here or the new version in the fsspec layer, needs the actual URL of the data, so that you can download it and point to the local copy either by path or by file object.

aaronspring commented 4 years ago
ds = xr.open_dataset('http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.CESM/.46LCESM1/.hindcast/.ua/dods')
ds.dims
Frozen(SortedKeysDict({'S': 887, 'M': 10, 'X': 360, 'L': 45, 'Y': 181, 'P': 2}))

but I don't understand why this works because I also only get this HTML file in the browser.

ok, so without a url pointing to a file (with .nc or another ending) caching wouldnt work.

martindurant commented 4 years ago

You could maybe use .persist() or .export() as convenient ways to transform the data to zarr and save locally. Not what you were after...

aaronspring commented 4 years ago

I was naively hoping for intake to do that. but I can easily build my own caching system here.

martindurant commented 4 years ago

Probably someone at xarray can think about how to do this automatically