lucianopaz / compress_pickle

Standard python pickle, thinly wrapped with standard compression libraries
MIT License
41 stars 10 forks source link

Fails deserializing multiple objects #24

Open zacps opened 3 years ago

zacps commented 3 years ago

Pickle is capable of storing multiple objects in the same file as each dump is self-contained.

Hence the following:

from pickle import dump, load
with open('test.gz', 'wb') as f:
    dump(1, f)
    dump(2, f)
with open('test.gz', 'rb') as f:
    print(load(f))
    print(load(f))

Returns

1
2

But with this library:

from compressed_pickle import dump, load
with open('test.gz', 'wb') as f:
    dump(1, f, compression='gzip')
    dump(2, f, compression='gzip')
with open('test.gz', 'rb') as f:
    print(load(f, compression='gzip'))
    print(load(f, compression='gzip'))

It returns

1
\compress_pickle\compress_pickle.py in load(path, compression, mode, fix_imports, encoding, errors, buffers, arcname, set_default_extension, unhandled_extensions, **kwargs)
    334     else:
    335         try:
--> 336             output = pickle.load(  # type: ignore
    337                 io_stream,
    338                 encoding=encoding,

EOFError: Ran out of input
lucianopaz commented 3 years ago

Oh, that would be nice to have. I think that some of the compression packages, like zip, will make this difficult to support. Anyway, thanks for reporting @zacps! I'll see what I can do about this functionality when I get some time.

zacps commented 3 years ago

With zip I think you could do it by writing each call as an individual file in the zip archive. It's possible to add files to a zip without rewriting anything other than the central directory.

My approach would be to check if the current object is seekable, and if it is and the cursor position is non-zero assume we're in the middle of an archive. At that point read the central directory to find where the start of the next file should be, write it, then write the central directory back in.

At load:

eode commented 2 years ago

Maybe from an API standpoint, considering that some compression algorithms will support multiple objects and some won't, you could simply add an allow_multiple=True to be added wherever compression is specified (thus giving it a chance to reject an unsupported compression algorithm/allow_multiple pair).