Open rsignell-usgs opened 4 years ago
Not sure, but I think the h5netcdf
engine is the only one that allows for file-like objects (so anything going through fsspec)
Okay @scottyhq, I tried setting engine='h5netcdf'
, but still got:
OSError: Seek only available in read mode
Thinking about this a little more, it's pretty clear why writing NetCDF to S3 would require seek mode.
I asked @martindurant about supporting seek for writing in fsspec
and he said that would be pretty hard. And in fact, the performance probably would be pretty terrible as lots of little writes would be required.
So maybe it's best just to write netcdf files locally and then push them to S3.
And to facilitate that, @martindurant merged a PR yesterday to enable simplecache
for writing in fsspec
, so after doing:
pip install git+https://github.com/intake/filesystem_spec.git
in my environment, this now works:
import xarray as xr
import fsspec
ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
'/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')
outfile = fsspec.open('simplecache::s3://chs-pangeo-data-bucket/rsignell/foo2.nc',
mode='wb', s3=dict(profile='default'))
with outfile as f:
ds.to_netcdf(f)
(Here I'm telling fsspec
to use the AWS credentials in my "default" profile)
Thanks Martin!!!
I think we should add some documentation on this stuff.
We have "cloud storage buckets" under zarr( https://xarray.pydata.org/en/stable/io.html#cloud-storage-buckets) so maybe a similar section under netCDF?
The write feature for simplecache isn't released yet, of course.
It would be interesting if someone could subclass file and write locally with h5netcdf to see what kind of seeks it does. Is it popping back to some file header to update array sizes? Presumably it would need a fixed-size header to do that. Parquet and other cloud formats have the metadata at the footer exactly for this reason, so you only write once you know everything and you only ever move forward in the fie.
@martindurant, I asked @ajelenak offline and he reminded me that:
File metadata are dispersed throughout an HDF5 [and NetCDF4] file in order to support writing and modifying array sizes at any time of execution
Looking forward to simplecache::
for writing in fsspec=0.7.5
!
I’ve run into this as well. It’s not pretty, but my usual work around is to write to a local temporary file and then upload with fsspec. I can never remember exactly which netCDF engine to use...
I'm closing this the recommended approach for writing NetCDF to object stroage is to write locally, then push.
Is there any reliable to use to write a xr.Dataset object as a netcdf file in 2023? I tried using the above approach with fsspec
but I keep getting a OSError: [Errno 29] Seek only available in read mode
exception.
I can confirm that something like the following does work, basically automating the "write local and then push" workflow:
import xarray as xr
import fsspec
ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
'/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')
outfile = fsspec.open('simplecache::gcs://mdtemp/foo2.nc',
mode='wb')
with outfile as f:
ds.to_netcdf(f)
Unfortunately, directly writing to the remote file without a local cached file is not supported, because HDF5 does not write in a linear way.
'/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf') outfile = fsspec.open('simpl
Thanks, this actually worked for me. It seems as though initializing an s3 store using fs = fsspec.S3FileSystem(...)
beforehand and using it as a context manager via with fs.open(...) as out: data.to_netcdf(out)
caused the failure.
Would you mind writing out long-hand the version that worked and the version that didn't?
What didn't work:
f = fsspec.filesystem("s3", anon=False)
with f.open("some_bucket/some_remote_destination.nc", mode="wb") as ff:
xr.open_dataset("some_local_file.nc").to_netcdf(ff)
this results in a OSError: [Errno 29] Seek only available in read mode
exception
Changing the above to
with fsspec.open("simplecache::s3://some_bucket/some_remote_destination.nc", mode="wb") as ff:
xr.open_dataset("some_local_file.nc").to_netcdf(ff)
fixed it.
What didn't work:
f = fsspec.filesystem("s3", anon=False) with f.open("some_bucket/some_remote_destination.nc", mode="wb") as ff: xr.open_dataset("some_local_file.nc").to_netcdf(ff)
this results in a
OSError: [Errno 29] Seek only available in read mode
exceptionChanging the above to
with fsspec.open("simplecache::s3://some_bucket/some_remote_destination.nc", mode="wb") as ff: xr.open_dataset("some_local_file.nc").to_netcdf(ff)
fixed it.
How would you go about reading this file? Once it is saved in s3
Im currently getting an error
ValueError: b'CDF\x02\x00\x00\x00\x00' is not the signature of a valid netCDF4 file
Maybe it is netCDF3? xarray is supposed to be able to determine the file type
with fsspec.open("s3://some_bucket/some_remote_destination.nc", mode="rb") as ff:
ds = xr.open_dataset(ff)
but maybe play with the engine= argument.
Maybe it is netCDF3? xarray is supposed to be able to determine the file type
with fsspec.open("s3://some_bucket/some_remote_destination.nc", mode="rb") as ff: ds = xr.open_dataset(ff)
but maybe play with the engine= argument.
Thanks, i tried to make sure it was engine=h5netcdf
when saving, but not sure this worked
@peterdudfield Could this be an issue with how you wrote out your NetCDF file? When I write to requester pays buckets, my approach using your variables would look like this and incorporates instructions from fsspec for remote write caching:
url = "simplecache::s3://file_path/to/file.nc"
with fsspec.open(url,
mode="wb",
s3={"profile": "default"}) as ff: # the important part is using s3={"profile":"your_aws_profile"}
daymet_sel.to_netcdf(ff)
I haven't had any read issues with files saved this way.
It could be here's some running notes of mine - https://github.com/openclimatefix/MetOfficeDataHub/issues/65
The same method is
with fsspec.open("simplecache::s3://nowcasting-nwp-development/data/test.netcdf", mode="wb") as f:
dataset.to_netcdf(f,engine='h5netcdf')
I never needed to specify an engine when writing, you only need it when reading the file. I use the engine="scipy"
one for reading.
I use the engine="scipy" one for reading.
This is netCDF3, in that case. If that's fine for you, no problem.
I use the engine="scipy" one for reading.
This is netCDF3, in that case. If that's fine for you, no problem.
What do you mean this is netcdf3
?
I never needed to specify an engine when writing, you only need it when reading the file. I use the
engine="scipy"
one for reading.
using engine="scipy"
worked - thank you
scipy only reads/writes netcdf2/3 ( https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html ), which is a very different and simpler format than netcdf4. The latter uses HDF5 as a container, and h5netcdf as the xarray engine. I guess "to_netcdf" is ambiguous.
Based on the docs
The default format is NETCDF4 if you are saving a file to disk and have the netCDF4-python library available. Otherwise, xarray falls back to using scipy to write netCDF files and defaults to the NETCDF3_64BIT format (scipy does not support netCDF4).
It appears scipy engine is safe is one does not need to be bothered with specifying engines.By the way, what are the limitations of the netcdf3
standard vs netcdf4
?
what are the limitations of the netcdf3 standard vs netcdf4
No compression, encoding or chunking except for the one "append" dimension.
I am currently running into this on gcs.
Minimal reproducer (running on the LEAP hub with pangeo/pangeo-notebook:2024.05.21):
import xarray as xr
import dask.array as dsa
import os
import fsspec
target_prefix = "gs://leap-scratch/jbusecke/huggingface_test"
ds = xr.DataArray(dsa.random.random([20, 11, 30, 10], chunks='auto'), dims= ['x', 'y', 'z', 'time']).to_dataset(name='data_a')
filename = 'simplecache::gs://leap-scratch/jbusecke/some_file.nc'
# # Gives: ValueError: I/O operation on closed file
outfile = fsspec.open(filename, mode='wb')
with outfile as f:
ds_split.to_netcdf(f, engine='netcdf4') # same with 'h5netcdf'
# Gives I/O operation on closed file
with fsspec.open(filename, mode='wb') as f:
ds_split.to_netcdf(f, engine='netcdf4') # same with 'h5netcdf'
The error I am getting here seems different from previous posters?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[42], line 14
12 # # Gives: ValueError: I[/O](https://leap.2i2c.cloud/O) operation on closed file
13 outfile = fsspec.open(filename, mode='wb')
---> 14 with outfile as f:
15 ds_split.to_netcdf(f, engine='netcdf4') # same with 'h5netcdf'
17 # Gives I[/O](https://leap.2i2c.cloud/O) operation on closed file
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py:134](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py#line=133), in OpenFile.__exit__(self, *args)
133 def __exit__(self, *args):
--> 134 self.close()
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py:154](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py#line=153), in OpenFile.close(self)
152 if "r" not in self.mode and not f.closed:
153 f.flush()
--> 154 f.close()
155 self.fobjects.clear()
File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/cached.py:915](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/cached.py#line=914), in LocalTempFile.close(self)
914 def close(self):
--> 915 self.size = self.fh.tell()
916 if self.closed:
917 return
ValueError: I[/O](https://leap.2i2c.cloud/O) operation on closed file
Greatful for any advice how to fix this.
Happy to add something like this (once its fixed) to the docs as @dcherian suggested above. Seems like an important part that is missing right now to make it easier to work with cloud storage.
I can confirm that the following change in fsspec works
--- a/fsspec/implementations/cached.py
+++ b/fsspec/implementations/cached.py
@@ -901,7 +901,7 @@ class LocalTempFile:
self.close()
def close(self):
- self.size = self.fh.tell()
+ # self.size = self.fh.tell()
if self.closed:
return
Can anyone think of a reason why a local cached file should update its size at the time of upload?
🤷♂️
This is the solution I have arrived at to avoid using scipy engine (tested with gcsfs):
import fsspec
import xarray as xr
from upath import UPath
path = UPath("gs://some-bucket/some-file.nc")
# read netcdf
with path.open("rb") as f:
ds = xr.load_dataset(f, engine="h5netcdf")
# write netcdf
with fsspec.open(f"simplecache::{path.as_uri()}", mode="wb") as f:
ds.to_netcdf(f.name, engine="h5netcdf")
... the fact that f.name
gives you the temporary file name on disk from the fsspec.open context is not obvious, but seems to work 🤷 As a bonus, the same code (inefficiently) works if the file is local as well (e.g. path = UPath("/some/local/file.nc"
)
For me, the "I/O operation on a closed file"
error seem to be related to reading with open_dataset
(lazy) instead of load_dataset
(eager)... so no chunking with this solution but for my purposes here that doesn't matter. If it did, I'd use Zarr 😅
I'm trying to write a netcdf file directly from xarray to S3 object storage. I'm wondering:
scipy
engine is getting used instead of the specifiednetcdf4
engine.Code sample:
which produces:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.138-114.102.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4 xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.3 pydap: installed h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: 2.4.0 bottleneck: None dask: 2.14.0 distributed: 2.14.0 matplotlib: 3.2.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.1 conda: None pytest: 5.4.1 IPython: 7.13.0 sphinx: None