telegraphic / hickle

a HDF5-based python pickle replacement
http://telegraphic.github.io/hickle/
Other
485 stars 70 forks source link

Dumping to io.BufferedReader Fails #144

Closed vladfi1 closed 3 years ago

vladfi1 commented 3 years ago

I am trying to dump to in-memory bytes so that I can then compress these bytes with zlib before writing to disk.

raw = io.BytesIO()
writer = io.BufferedWriter(raw)
hickle.dump(obj, writer)

The last line raises AttributeError: '_io.BytesIO' object has no attribute 'name'.

hernot commented 3 years ago

I think the error comes from these lines in file_opener function

 if isinstance(f, (io.TextIOWrapper, io.BufferedWriter)):
        filename, mode = f.name, f.mode
        f.close()
        mode = mode.replace('b', '')
        h5f = h5.File(filename, mode)

instead of passing the file like object to h5py as is or at least specifiying driver as 'fileobj'

   h5f = h5.File(f,mode,driver = 'fileobj')

it is tried to get the file name form the passed in TextIOWrapper or BufferedWrite (any other filelike objects are ignored) and its mode and than the file is closed and a new hdf5 file with specified file name and node is opened. Alternatively only an already opened h5py.File object or a plain file path string are accepted any thing else causes an exception. Possibly on finalization of next minor or major release somebody shall have a look to it h5py from 2.10 on definitely is capable of handling filelike objects as long as they provide read, seek, tell and write method. (see h5py.File)

1313e commented 3 years ago

Oh, I guess I missed this one.

I am not entirely sure why that does not work, as we do test for it, but I will have a look. However, as h5py was recently updated to 3.0.0, which brought a ton of changes (that are also incompatible with hickle), I am more planning on doing a pass over the entire package to account for that. I will add this issue to that list, but it may take a while before it will be fixed.

hernot commented 3 years ago

@1313e @telegraphic just in case it might be of any interest to you or even any help at all i wanted to let you know that: Beeing a bit boored while waiting for @telegraphic to decide upon pull request #138 i tried to do some proof of concept for handling file and file like objects as supported by h5py. The results of this trial and error can also be found in the detached concept_memp_compact_expand branch of my hickle fork.

Yes its very duck-type'ish python'ish but if you look at h5py.File its init method just checks when file or file like objects passed for existance of 'read' and 'try' attribute.

hernot commented 3 years ago

The same here, would be included in my finalize and cleanup pullrequest after #138, and upcomming for #139 and #145.

1313e commented 3 years ago

After looking into it, I realize that this is not an error. hickle can solely be used to dump to HDF5-files. A BufferedWriter is not an HDF5-file, so hickle cannot dump to it.

vladfi1 commented 3 years ago

Are there plans to support writing to in-memory bytes rather than files?

1313e commented 3 years ago

Not at the moment, no.

telegraphic commented 3 years ago

@vladfi1 just to chime in here: hickle is indeed designed specifically for dumping to HDF5 files, and uses h5py as its API -- which doesn't support BufferedWriter. If you really wanted a HDF5 file in memory, you could try setting up a ramdisk? However I think there are probably better solutions out there for in-memory data storage...

hernot commented 3 years ago

@vladfi1 just to chime in here: hickle is indeed designed specifically for dumping to HDF5 files, and uses h5py as its API -- which doesn't support BufferedWriter. If you really wanted a HDF5 file in memory, you could try setting up a ramdisk? However I think there are probably better solutions out there for in-memory data storage...

@telegraphic @1313e not so true, according to documentation for h5py 2.10 and onward they support any file like object which is capable of reading and writing binary data and which is seekable and io.BytesIO exactly full fills that, one can find that example in h5py manual . Thus the questions is rather is it worth the efforts to add all the required checks whether passed in file-like object conforms to requirements of h5py or not. In case not remove support for file-like objects and Python file handles from hickle and support only.

@vladfi1 why would you need the io,BufferedWriter. io.BytesIO is already a io,BufferedIOBase type object (see Python IO manual) like io.BufferedWriter and io.BufferedReader are and thus is already buffered. So replace io.BufferedWriter simply by h5py.File to make your example work.

raw = io.BytesIO
writer = h5py.File(raw)
hickle.dump(obj,writer,mode='w')

and on read

reader = h5py.File(raw)
hickle.load(obj,reader,mode='r')

So you see no need for io,BufferedWriter at all or in other words h5py.File acts as wrapping writer and reader.

1313e commented 3 years ago

@hernot It is still true what we are saying. It does not matter if h5py allows writing to other filetypes, hickle does not support it.

hernot commented 3 years ago

Yes you are right, bad wording from my side. What i wanted to say, is if it does not support it it should not allow to pass file objects and file-like objects at all. As how it is done now is broken and against expectations when passing file objects, with the consequence that this will not stay the only issue related to strange or broken support of file and file like objects.###

An example:

fid = open('/tmp/somefile.h5','w+b')
writer = io.BufferedWriter(fid)
hickle.dump(obj,writer)
fid.flush()
fid.seek(0)
somesocket.write(fid.read())

But that does not work as hickle will just takes the filename and closes the original file or file-like object and replaces the underlying file on disk with a completely new file with the same name and the default access rights files owned by the process running hickle not the ones of the original file and also not necessarily with the same rights of the original file. This does not make sense at all to me . Why should i first open a file which is never used or even worse when reading for an already written hickle file the file is deleted. When i open a file beforehand i want hickle to place the hdf5 file content exactly in that file and nothing else and the wrapping inside 'io.BufferedWriter' or 'io.TextIOWrapper' should not make any difference here. So either and that is what i meant take the decision to properly support file and file like-objects eg from hickle >= 5.0 on and take the efforts to fix it until then or decide not to support file and file-like objects at all beyond indirect support through passed in h5py.File objects. Than remove support completely only allowing filename strings and h5py.File objects to be passed.