pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.65k stars 1.09k forks source link

Document writing netcdf from xarray directly to S3 #4122

Open rsignell-usgs opened 4 years ago

rsignell-usgs commented 4 years ago

I'm trying to write a netcdf file directly from xarray to S3 object storage. I'm wondering:

  1. Why writing NetCDF files requires a "seek"
  2. Why the scipy engine is getting used instead of the specified netcdf4 engine.
  3. If there are nice workarounds (besides writing the NetCDF file locally, then using the AWS CLI to transfer to S3)

Code sample:

import fsspec
import xarray as xr

ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
                     '/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')

outfile = fsspec.open('s3://chs-pangeo-data-bucket/rsignell/test.nc', 
                      mode='wb', profile='default')

with outfile as f:
    ds.to_netcdf(f, engine='netcdf4')

which produces:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-024939f31fe4> in <module>
      2                       mode='wb', profile='default')
      3 with outfile as f:
----> 4     ds.to_netcdf(f, engine='netcdf4')

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1552             unlimited_dims=unlimited_dims,
   1553             compute=compute,
-> 1554             invalid_netcdf=invalid_netcdf,
   1555         )
   1556 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1102     finally:
   1103         if not multifile and compute:
-> 1104             store.close()
   1105 
   1106     if not compute:

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/backends/scipy_.py in close(self)
    221 
    222     def close(self):
--> 223         self._manager.close()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/xarray/backends/file_manager.py in close(***failed resolving arguments***)
    331     def close(self, needs_lock=True):
    332         del needs_lock  # ignored
--> 333         self._value.close()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/scipy/io/netcdf.py in close(self)
    297         if hasattr(self, 'fp') and not self.fp.closed:
    298             try:
--> 299                 self.flush()
    300             finally:
    301                 self.variables = OrderedDict()

/srv/conda/envs/pangeo/lib/python3.7/site-packages/scipy/io/netcdf.py in flush(self)
    407         """
    408         if hasattr(self, 'mode') and self.mode in 'wa':
--> 409             self._write()
    410     sync = flush
    411 

/srv/conda/envs/pangeo/lib/python3.7/site-packages/scipy/io/netcdf.py in _write(self)
    411 
    412     def _write(self):
--> 413         self.fp.seek(0)
    414         self.fp.write(b'CDF')
    415         self.fp.write(array(self.version_byte, '>b').tostring())

/srv/conda/envs/pangeo/lib/python3.7/site-packages/fsspec/spec.py in seek(self, loc, whence)
   1122         loc = int(loc)
   1123         if not self.mode == "rb":
-> 1124             raise OSError("Seek only available in read mode")
   1125         if whence == 0:
   1126             nloc = loc

OSError: Seek only available in read mode
Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.14.138-114.102.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.4 xarray: 0.15.1 pandas: 1.0.3 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.3 pydap: installed h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.1.1.2 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.3 cfgrib: None iris: 2.4.0 bottleneck: None dask: 2.14.0 distributed: 2.14.0 matplotlib: 3.2.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 46.1.3.post20200325 pip: 20.1 conda: None pytest: 5.4.1 IPython: 7.13.0 sphinx: None
scottyhq commented 4 years ago

Not sure, but I think the h5netcdf engine is the only one that allows for file-like objects (so anything going through fsspec)

rsignell-usgs commented 4 years ago

Okay @scottyhq, I tried setting engine='h5netcdf', but still got:

OSError: Seek only available in read mode

Thinking about this a little more, it's pretty clear why writing NetCDF to S3 would require seek mode.

I asked @martindurant about supporting seek for writing in fsspec and he said that would be pretty hard. And in fact, the performance probably would be pretty terrible as lots of little writes would be required.

So maybe it's best just to write netcdf files locally and then push them to S3.

And to facilitate that, @martindurant merged a PR yesterday to enable simplecache for writing in fsspec, so after doing:

pip install git+https://github.com/intake/filesystem_spec.git

in my environment, this now works:

import xarray as xr
import fsspec

ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
                        '/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')

outfile = fsspec.open('simplecache::s3://chs-pangeo-data-bucket/rsignell/foo2.nc', 
                      mode='wb', s3=dict(profile='default'))
with outfile as f:
    ds.to_netcdf(f)

(Here I'm telling fsspec to use the AWS credentials in my "default" profile)

Thanks Martin!!!

dcherian commented 4 years ago

I think we should add some documentation on this stuff.

We have "cloud storage buckets" under zarr( https://xarray.pydata.org/en/stable/io.html#cloud-storage-buckets) so maybe a similar section under netCDF?

martindurant commented 4 years ago

The write feature for simplecache isn't released yet, of course.

It would be interesting if someone could subclass file and write locally with h5netcdf to see what kind of seeks it does. Is it popping back to some file header to update array sizes? Presumably it would need a fixed-size header to do that. Parquet and other cloud formats have the metadata at the footer exactly for this reason, so you only write once you know everything and you only ever move forward in the fie.

rsignell-usgs commented 4 years ago

@martindurant, I asked @ajelenak offline and he reminded me that:

File metadata are dispersed throughout an HDF5 [and NetCDF4] file in order to support writing and modifying array sizes at any time of execution

Looking forward to simplecache:: for writing in fsspec=0.7.5!

nbren12 commented 4 years ago

I’ve run into this as well. It’s not pretty, but my usual work around is to write to a local temporary file and then upload with fsspec. I can never remember exactly which netCDF engine to use...

rsignell-usgs commented 3 years ago

I'm closing this the recommended approach for writing NetCDF to object stroage is to write locally, then push.

zoj613 commented 1 year ago

Is there any reliable to use to write a xr.Dataset object as a netcdf file in 2023? I tried using the above approach with fsspec but I keep getting a OSError: [Errno 29] Seek only available in read mode exception.

martindurant commented 1 year ago

I can confirm that something like the following does work, basically automating the "write local and then push" workflow:

import xarray as xr
import fsspec
ds = xr.open_dataset('http://geoport.usgs.esipfed.org/thredds/dodsC'
                        '/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf')
outfile = fsspec.open('simplecache::gcs://mdtemp/foo2.nc',
                      mode='wb')
with outfile as f:
        ds.to_netcdf(f)

Unfortunately, directly writing to the remote file without a local cached file is not supported, because HDF5 does not write in a linear way.

zoj613 commented 1 year ago

'/silt/usgs/Projects/stellwagen/CF-1.6/BUZZ_BAY/2651-A.cdf') outfile = fsspec.open('simpl

Thanks, this actually worked for me. It seems as though initializing an s3 store using fs = fsspec.S3FileSystem(...) beforehand and using it as a context manager via with fs.open(...) as out: data.to_netcdf(out) caused the failure.

martindurant commented 1 year ago

Would you mind writing out long-hand the version that worked and the version that didn't?

zoj613 commented 1 year ago

What didn't work:

f = fsspec.filesystem("s3", anon=False)
with f.open("some_bucket/some_remote_destination.nc", mode="wb") as ff:
    xr.open_dataset("some_local_file.nc").to_netcdf(ff)

this results in a OSError: [Errno 29] Seek only available in read mode exception

Changing the above to

with fsspec.open("simplecache::s3://some_bucket/some_remote_destination.nc", mode="wb") as ff:
    xr.open_dataset("some_local_file.nc").to_netcdf(ff)

fixed it.

peterdudfield commented 1 year ago

What didn't work:

f = fsspec.filesystem("s3", anon=False)
with f.open("some_bucket/some_remote_destination.nc", mode="wb") as ff:
  xr.open_dataset("some_local_file.nc").to_netcdf(ff)

this results in a OSError: [Errno 29] Seek only available in read mode exception

Changing the above to

with fsspec.open("simplecache::s3://some_bucket/some_remote_destination.nc", mode="wb") as ff:
  xr.open_dataset("some_local_file.nc").to_netcdf(ff)

fixed it.

How would you go about reading this file? Once it is saved in s3

Im currently getting an error

ValueError: b'CDF\x02\x00\x00\x00\x00' is not the signature of a valid netCDF4 file
martindurant commented 1 year ago

Maybe it is netCDF3? xarray is supposed to be able to determine the file type

with fsspec.open("s3://some_bucket/some_remote_destination.nc", mode="rb") as ff:
    ds = xr.open_dataset(ff)

but maybe play with the engine= argument.

peterdudfield commented 1 year ago

Maybe it is netCDF3? xarray is supposed to be able to determine the file type

with fsspec.open("s3://some_bucket/some_remote_destination.nc", mode="rb") as ff:
  ds = xr.open_dataset(ff)

but maybe play with the engine= argument.

Thanks, i tried to make sure it was engine=h5netcdf when saving, but not sure this worked

alaws-USGS commented 1 year ago

@peterdudfield Could this be an issue with how you wrote out your NetCDF file? When I write to requester pays buckets, my approach using your variables would look like this and incorporates instructions from fsspec for remote write caching:

url = "simplecache::s3://file_path/to/file.nc"
with fsspec.open(url, 
                 mode="wb", 
                 s3={"profile": "default"}) as ff: # the important part is using s3={"profile":"your_aws_profile"}
    daymet_sel.to_netcdf(ff)

I haven't had any read issues with files saved this way.

peterdudfield commented 1 year ago

It could be here's some running notes of mine - https://github.com/openclimatefix/MetOfficeDataHub/issues/65

The same method is

with fsspec.open("simplecache::s3://nowcasting-nwp-development/data/test.netcdf", mode="wb") as f:
    dataset.to_netcdf(f,engine='h5netcdf')
zoj613 commented 1 year ago

I never needed to specify an engine when writing, you only need it when reading the file. I use the engine="scipy" one for reading.

martindurant commented 1 year ago

I use the engine="scipy" one for reading.

This is netCDF3, in that case. If that's fine for you, no problem.

peterdudfield commented 1 year ago

I use the engine="scipy" one for reading.

This is netCDF3, in that case. If that's fine for you, no problem.

What do you mean this is netcdf3?

peterdudfield commented 1 year ago

I never needed to specify an engine when writing, you only need it when reading the file. I use the engine="scipy" one for reading.

using engine="scipy" worked - thank you

martindurant commented 1 year ago

scipy only reads/writes netcdf2/3 ( https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html ), which is a very different and simpler format than netcdf4. The latter uses HDF5 as a container, and h5netcdf as the xarray engine. I guess "to_netcdf" is ambiguous.

zoj613 commented 1 year ago

Based on the docs

The default format is NETCDF4 if you are saving a file to disk and have the netCDF4-python library available. Otherwise, xarray falls back to using scipy to write netCDF files and defaults to the NETCDF3_64BIT format (scipy does not support netCDF4).

It appears scipy engine is safe is one does not need to be bothered with specifying engines.By the way, what are the limitations of the netcdf3 standard vs netcdf4?

martindurant commented 1 year ago

what are the limitations of the netcdf3 standard vs netcdf4

No compression, encoding or chunking except for the one "append" dimension.

jbusecke commented 5 months ago

I am currently running into this on gcs.

Minimal reproducer (running on the LEAP hub with pangeo/pangeo-notebook:2024.05.21):

import xarray as xr
import dask.array as dsa
import os
import fsspec

target_prefix = "gs://leap-scratch/jbusecke/huggingface_test"

ds = xr.DataArray(dsa.random.random([20, 11, 30, 10], chunks='auto'), dims= ['x', 'y', 'z', 'time']).to_dataset(name='data_a')

filename = 'simplecache::gs://leap-scratch/jbusecke/some_file.nc'

# # Gives: ValueError: I/O operation on closed file
outfile = fsspec.open(filename, mode='wb')
with outfile as f:
    ds_split.to_netcdf(f, engine='netcdf4') # same with 'h5netcdf'

# Gives I/O operation on closed file
with fsspec.open(filename, mode='wb') as f:
    ds_split.to_netcdf(f, engine='netcdf4') # same with 'h5netcdf'

The error I am getting here seems different from previous posters?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[42], line 14
     12 # # Gives: ValueError: I[/O](https://leap.2i2c.cloud/O) operation on closed file
     13 outfile = fsspec.open(filename, mode='wb')
---> 14 with outfile as f:
     15     ds_split.to_netcdf(f, engine='netcdf4') # same with 'h5netcdf'
     17 # Gives I[/O](https://leap.2i2c.cloud/O) operation on closed file

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py:134](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py#line=133), in OpenFile.__exit__(self, *args)
    133 def __exit__(self, *args):
--> 134     self.close()

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py:154](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/core.py#line=153), in OpenFile.close(self)
    152     if "r" not in self.mode and not f.closed:
    153         f.flush()
--> 154     f.close()
    155 self.fobjects.clear()

File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/cached.py:915](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/cached.py#line=914), in LocalTempFile.close(self)
    914 def close(self):
--> 915     self.size = self.fh.tell()
    916     if self.closed:
    917         return

ValueError: I[/O](https://leap.2i2c.cloud/O) operation on closed file

Greatful for any advice how to fix this.

Happy to add something like this (once its fixed) to the docs as @dcherian suggested above. Seems like an important part that is missing right now to make it easier to work with cloud storage.

martindurant commented 5 months ago

I can confirm that the following change in fsspec works

--- a/fsspec/implementations/cached.py
+++ b/fsspec/implementations/cached.py
@@ -901,7 +901,7 @@ class LocalTempFile:
         self.close()

     def close(self):
-        self.size = self.fh.tell()
+        # self.size = self.fh.tell()
         if self.closed:
             return

Can anyone think of a reason why a local cached file should update its size at the time of upload?

jbusecke commented 4 months ago

🤷‍♂️

ollie-bell commented 1 month ago

This is the solution I have arrived at to avoid using scipy engine (tested with gcsfs):

import fsspec
import xarray as xr
from upath import UPath

path = UPath("gs://some-bucket/some-file.nc")

# read netcdf
with path.open("rb") as f:
    ds = xr.load_dataset(f, engine="h5netcdf")

# write netcdf
with fsspec.open(f"simplecache::{path.as_uri()}", mode="wb") as f:
    ds.to_netcdf(f.name, engine="h5netcdf")

... the fact that f.name gives you the temporary file name on disk from the fsspec.open context is not obvious, but seems to work 🤷 As a bonus, the same code (inefficiently) works if the file is local as well (e.g. path = UPath("/some/local/file.nc")

For me, the "I/O operation on a closed file" error seem to be related to reading with open_dataset (lazy) instead of load_dataset (eager)... so no chunking with this solution but for my purposes here that doesn't matter. If it did, I'd use Zarr 😅