rickardp / splitstream

Continuous object splitter for C and Python
Apache License 2.0
44 stars 9 forks source link

[Python] Unable to read from gzip stream #2

Closed viernullvier closed 4 years ago

viernullvier commented 9 years ago

The splitstream module somehow doesn't work with gzip file streams. I've been unable to trace the reason for this issue since no exception is raised, it just doesn't work at all.

Workaround: Wrapping the entire file into a StringIO stream - BufferedReader doesn't work either, it seems to pass the unprocessed gzip data to splitstream.

Tested with Python 2.7.10 on OS X

from splitstream import splitfile
from cStringIO import StringIO
from io import BufferedReader
import gzip

with gzip.open("file.gz", 'rb') as f:
    for obj in splitfile(f, "json"):
        print obj  # will never get called
    f.seek(0)
    for obj in splitfile(BufferedReader(f), "json"):
        print obj  # returns garbage
    f.seek(0)
    for obj in splitfile(StringIO(f.read()), "json"):
        print obj  # works as intended
rickardp commented 8 years ago

This was actually a design choice : if there is a fileno, it is assumed that it could be used directly instead of calling into Python for each chunk to read.

However, it seems that gzip file objects also contain a fileno, which is not how I understand the docs. I could add an option to disable this optimization, but I would prefer a canonical way of determining if it is safe to use the fileno.

A workaround that does not require reading the entire file to a string would be to wrap the file object in an object that does not expose the fileno() method, e.g.

class Wrapper(object):
    def __init__(self, f):
        self.__f = f
    def read(self, *n):
        return self.__f.read(*n)

for obj in splitfile(Wrapper(f), "json"):
    print obj
rickardp commented 4 years ago

Fixed in PR #10 now that I finally migrated to GitHub actions