h5py / h5py

HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
http://www.h5py.org
BSD 3-Clause "New" or "Revised" License
2.07k stars 524 forks source link

Writing new file/appending to file in s3 using s3fs fails #1860

Closed shteren1 closed 3 years ago

shteren1 commented 3 years ago

Hi, while reading existing files from s3 storage works like a charm (replace 'ab' with 'rb' and 'a' with 'r' in the below example) with s3fs, trying to write files or append to existing files fails.

Tested package builds from conda (latest package available currently): s3fs 0.5.2 pyhd8ed1ab_0 conda-forge h5py 3.1.0 nompi_py37h1e651dc_100 defaults

s3fs supports writing and appending to files in s3 for json/csv/other text files.

Consider the following example:

import s3fs
import h5py

key = "s3://your-bucket/your_key.h5"
fs = s3fs.S3FileSystem()
with fs.open(key, 'ab') as h5file:
     with h5py.File(h5file, 'a') as h5f:
        print(list(h5f.keys()))

This fails on this error: ValueError: Invalid value of 'fileobj' argument; must equal to file-like object if specified. On line 407 in h5py/_hl/files.py, which doesn't make much sense, its basically testing the s3file object against itself and returning False, when i open the same file with 'rb' instead of 'ab' using s3fs and 'r' using h5py.File this test returns True.

I tried to meddle a bit with the h5py code and commented this check, but the code still fails later due to this error: AttributeError: 'S3File' object has no attribute 'seek'

Thanks, Yotam Stern.

takluyver commented 3 years ago

I don't know how those errors appear, but I suspect this isn't going to work. s3fs is built on fsspec, which says that files are only seekable in read mode, but HDF5 needs to seek in all modes.

The no attribute 'seek' error is similar to #1434 and #1530. I thought we had fixed that. Are you sure you're using h5py 3.1? Check with h5py.version.info.

HDF5's "file drivers" API is not really well suited to data in object stores like S3. HDF5 assumes that it's fairly cheap to jump around in a file and read/write small amounts of data. Object stores are based on storing/retrieving a whole 'blob' at once. HDF group have been trying to address this, first with the HSDS system, and then with the virtual object layer, which is new in HDF5 1.12.

takluyver commented 3 years ago

Closing as I don't think there's any way to fix this. HDF5 expects that files are seekable, and s3fs files are only seekable in read mode.