intake / intake

Intake is a lightweight package for finding, investigating, loading and disseminating data.
https://intake.readthedocs.io/
BSD 2-Clause "Simplified" License
1.01k stars 141 forks source link

Remote NetCDF in Zip #846

Open pgierz opened 1 month ago

pgierz commented 1 month ago

Hello,

I'm having a bit of trouble finding an example in the handbook, so I thought I'd ask here.

I'm trying to implement opening a remote zip file containing NetCDF files. So far I have:

>>> import intake
>>> import fsspec
>>> import xarray as xr
>>> zipfs = fsspec.open("zip::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip", "r")
>>> zipfs.fs.glob("*/*nc")
['Supp/Green_Huber_eocene_tidal_dissipation_1x1.nc',
 'Supp/herold_etal_eocene_CAM4_BAM_aerosols.nc',
 'Supp/herold_etal_eocene_CAM4_BAM_optical_depth_1x1.nc',
 'Supp/herold_etal_eocene_biome_1x1.nc',
 'Supp/herold_etal_eocene_runoff_1x1.nc',
 'Supp/herold_etal_eocene_sewall_biomes_1x1.nc',
 'Supp/herold_etal_eocene_topo_1x1.nc',
 'Supp/herold_etal_stddev_subgrid_etopo1_to_eocene_1x1.nc']

That matches with what I get by hand:

❯ wget https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip 
--2024-09-18 08:50:14--  https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip
Resolving gmd.copernicus.org (gmd.copernicus.org)... 81.3.21.103
Connecting to gmd.copernicus.org (gmd.copernicus.org)|81.3.21.103|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 197731560 (189M) [application/zip]
Saving to: ‘gmd-7-2077-2014-supplement.zip’

gmd-7-2077-2014-supplement.zip             100%[=======================================================================================>] 188.57M  6.38MB/s    in 34s     

~ via 🅒 paul_sandbox took 34s 
❯ zipinfo gmd-7-2077-2014-supplement.zip  
Archive:  gmd-7-2077-2014-supplement.zip
Zip file size: 197731560 bytes, number of entries: 14
drwx---     2.0 fat        0 bx stor 14-Jul-11 09:50 Supp/
-rw-a--     2.0 fat   717606 bx defN 13-Dec-07 11:29 Supp/Fig.S1.pdf
-rw-a--     2.0 fat   148028 bx defN 13-Nov-26 05:51 Supp/Fig.S2.pdf
-rw-a--     2.0 fat  1039628 bx defN 13-Dec-07 08:46 Supp/Green_Huber_eocene_tidal_dissipation_1x1.nc
-rw-a--     2.0 fat    27875 bx defN 14-Jul-11 09:15 Supp/herold_etal_eocene_boundary_conditions_supplementary.pdf
-rw-a--     2.0 fat 224954636 bx defN 13-Dec-07 11:56 Supp/herold_etal_eocene_CAM4_BAM_aerosols.nc
-rw-a--     2.0 fat   262056 bx defN 14-Jul-08 16:39 Supp/herold_etal_eocene_CAM4_BAM_optical_depth_1x1.nc
-rw-a--     2.0 fat   778584 bx defN 14-Apr-28 23:52 Supp/herold_etal_eocene_runoff_1x1.nc
-rw-a--     2.0 fat   521208 bx defN 13-Oct-30 05:38 Supp/herold_etal_eocene_sewall_biomes_1x1.nc
-rw-a--     2.0 fat   262428 bx defN 14-Apr-18 00:46 Supp/herold_etal_eocene_topo_1x1.nc
-rw-a--     2.0 fat     7369 bx defN 14-Apr-23 05:53 Supp/herold_etal_stddev_subgrid_topo_regression.ncl
-rw-a--     2.0 fat  1041336 b- defN 14-Apr-22 00:08 Supp/herold_etal_stddev_subgrid_etopo1_to_eocene_1x1.nc
-rw-a--     2.0 fat  1299144 b- defN 14-Apr-17 01:09 Supp/herold_etal_eocene_biome_1x1.nc
-rw-a--     2.0 fat    17822 b- defX 14-Sep-16 08:14 supplement-cover-letter.pdf
14 files, 231077720 bytes uncompressed, 197729026 bytes compressed:  14.4%

If I now try to open one of these by hand:

>>> f = fsspec.open("zip::Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip")
>>> xr.open_dataset(f)

I get a FileNotFoundError, since it is looking on the local disk. What am I doing wrong?

The URL is public, so this example should work for anyone. Ultimately I'd like to embed this into an intake catalogue written in YAML. Any hints there would be great :-)

Something like this? The below is just a guess, it doesn't work:

metadata:
    version: 1
    description: "DeepMIP Input Files Catalog"
    # Define datasets
sources:
  herold-2014:
    description: "NetCDF files from a remote Zip"
    driver: netcdf
    args:
      urlpath: zip::**/*nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip
martindurant commented 1 month ago

The URL should be:

"
zip://**/*nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip
"

However, you should be aware that the netcdf driver (h5netcdf/h5py) does a lot of seeking while reading a file, and this will work really badly over ZIP unless the contained files are small and fit into the read cache (5MB by default). In the case of big files within a remote ZIP, you will need to make some decisions about how to cache them.

pgierz commented 1 month ago

Thanks @martindurant! That helped. I'm still struggling with the Yaml syntax for the catalog definition at the moment. So far I have:

version: 2
data:
  topo:
    user_parameters: {}
    description: "Specific NetCDF file from remote Zip"
    driver: netcdf
    args:
      urlpath: "zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip"
    metadata:
      file_in_zip: "Supp/herold_etal_eocene_topo_1x1.nc"
      origin: "Remote Zip"
  all_files:
    user_parameters: {}
    description: "Everything as a mfdataset"
    driver: netcdf
    args:
      urlpath: "zip://**/*.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip"
    metadata: {}
user_parameters: {}
aliases: {}
metadata: {}
entries: {}

I'm trying to understand the handbook, but can't seem to make sense of it. I'm more than happy to contribute my example to readthedocs, if it helps other users.

Caching is a separate problem, I'll need to learn more about intake's internals to figure out what to do there. Ideally I'd like to cache to disk (but automatically, so the end-user doesn't necessarily see this)

martindurant commented 1 month ago

Hm, I have had a go and while the following xarray calls work:

In [152]: with fsspec.open("zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip") as f:
     ...:     ds = xr.open_dataset(f)

In [153]: with fsspec.open_files("zip://**/*.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip") as ofs:
     ...:     ds = xr.open_mfdataset(ofs)
     ...:

just passing those URLs to xarray (which is what we do) does not work by itself. The code could make the same call, but the files need to be kept open, rather than being closed when the block ends.

martindurant commented 1 month ago

I made some changes that might be useful, and the the following works:

In [1]: import intake
s
In [2]: s = "zip://**/*nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip"

In [3]: data = intake.readers.NetCDF3("zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip")

In [4]: reader = intake.readers.XArrayDatasetReader(data)

In [5]: cat = intake.Catalog()

In [6]: cat["all_files"] = reader

In [7]: cat.to_dict()
Out[7]:
{'version': 2,
 'data': {'ec3a3039f33954ac': {'datatype': 'intake.readers.datatypes:NetCDF3',
   'kwargs': {'url': 'zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip',
    'storage_options': None},
   'metadata': {},
   'user_parameters': {}}},
 'aliases': {},
 'metadata': {},
 'user_parameters': {},
 'entries': {'all_files': {'reader': 'intake.readers.readers:XArrayDatasetReader',
   'kwargs': {'args': ['{data(ec3a3039f33954ac)}']},
   'output_instance': 'xarray:Dataset',
   'user_parameters': {},
   'metadata': {}}}}

(or .to_yaml_file to save it)

reader.read() gets

<xarray.Dataset>
Dimensions:  (lat: 180, lon: 360)
Coordinates:
  * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
Data variables:
    topo     (lat, lon) float32 ...
Attributes:
    title:    Eocene topography and bathymetry
    desc:     Derived predominantly from the early Eocene topography of Markw...
martindurant commented 1 month ago

Implementation: https://github.com/intake/intake/pull/848