Open pgierz opened 1 month ago
The URL should be:
"
zip://**/*nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip
"
However, you should be aware that the netcdf driver (h5netcdf/h5py) does a lot of seeking while reading a file, and this will work really badly over ZIP unless the contained files are small and fit into the read cache (5MB by default). In the case of big files within a remote ZIP, you will need to make some decisions about how to cache them.
Thanks @martindurant! That helped. I'm still struggling with the Yaml syntax for the catalog definition at the moment. So far I have:
version: 2
data:
topo:
user_parameters: {}
description: "Specific NetCDF file from remote Zip"
driver: netcdf
args:
urlpath: "zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip"
metadata:
file_in_zip: "Supp/herold_etal_eocene_topo_1x1.nc"
origin: "Remote Zip"
all_files:
user_parameters: {}
description: "Everything as a mfdataset"
driver: netcdf
args:
urlpath: "zip://**/*.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip"
metadata: {}
user_parameters: {}
aliases: {}
metadata: {}
entries: {}
I'm trying to understand the handbook, but can't seem to make sense of it. I'm more than happy to contribute my example to readthedocs, if it helps other users.
Caching is a separate problem, I'll need to learn more about intake's internals to figure out what to do there. Ideally I'd like to cache to disk (but automatically, so the end-user doesn't necessarily see this)
Hm, I have had a go and while the following xarray calls work:
In [152]: with fsspec.open("zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip") as f:
...: ds = xr.open_dataset(f)
In [153]: with fsspec.open_files("zip://**/*.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip") as ofs:
...: ds = xr.open_mfdataset(ofs)
...:
just passing those URLs to xarray (which is what we do) does not work by itself. The code could make the same call, but the files need to be kept open, rather than being closed when the block ends.
I made some changes that might be useful, and the the following works:
In [1]: import intake
s
In [2]: s = "zip://**/*nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip"
In [3]: data = intake.readers.NetCDF3("zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip")
In [4]: reader = intake.readers.XArrayDatasetReader(data)
In [5]: cat = intake.Catalog()
In [6]: cat["all_files"] = reader
In [7]: cat.to_dict()
Out[7]:
{'version': 2,
'data': {'ec3a3039f33954ac': {'datatype': 'intake.readers.datatypes:NetCDF3',
'kwargs': {'url': 'zip://Supp/herold_etal_eocene_topo_1x1.nc::https://gmd.copernicus.org/articles/7/2077/2014/gmd-7-2077-2014-supplement.zip',
'storage_options': None},
'metadata': {},
'user_parameters': {}}},
'aliases': {},
'metadata': {},
'user_parameters': {},
'entries': {'all_files': {'reader': 'intake.readers.readers:XArrayDatasetReader',
'kwargs': {'args': ['{data(ec3a3039f33954ac)}']},
'output_instance': 'xarray:Dataset',
'user_parameters': {},
'metadata': {}}}}
(or .to_yaml_file to save it)
reader.read()
gets
<xarray.Dataset>
Dimensions: (lat: 180, lon: 360)
Coordinates:
* lat (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* lon (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
Data variables:
topo (lat, lon) float32 ...
Attributes:
title: Eocene topography and bathymetry
desc: Derived predominantly from the early Eocene topography of Markw...
Implementation: https://github.com/intake/intake/pull/848
Hello,
I'm having a bit of trouble finding an example in the handbook, so I thought I'd ask here.
I'm trying to implement opening a remote zip file containing NetCDF files. So far I have:
That matches with what I get by hand:
If I now try to open one of these by hand:
I get a FileNotFoundError, since it is looking on the local disk. What am I doing wrong?
The URL is public, so this example should work for anyone. Ultimately I'd like to embed this into an intake catalogue written in YAML. Any hints there would be great :-)
Something like this? The below is just a guess, it doesn't work: