Added passing byte stream to FileInputStream

vstinner / hachoir

Hachoir is a Python library to view and edit a binary stream field by field

http://hachoir.readthedocs.io/

GNU General Public License v2.0

604 stars 70 forks source link

Added passing byte stream to FileInputStream #55

Closed allerter closed 4 years ago

allerter commented 4 years ago

Telethon uses hachoir to extract metadata from audio and video files. But I read that it only does so for files on the local disk since hachoir doesn't create parsers from byte arrays and byte streams. So I added the ability to create streams from the passed filename, so the parsers could be created from byte array and streams. Two things:

The context manager closes the file when it exits, so it also closes BytesIO files. But I don't know if that's the expected outcome, or the file's expected to be closed even if it's a BytesIO object.
I only tried passing the parser to extractMetadata, and that worked without a hitch. But I don't know if that'd be the case for other functions too. But since the stream gets constructed correctly, I don't see why there would be a problem. And all the tests succeed, so that must be good.

I'd love to hear your thoughts on this.

nneonneo commented 4 years ago

My impression is that FileInputStream should take a file as input, and InputIOStream would be more appropriate for a byte array (wrapped in an io.BytesIO).

allerter commented 4 years ago

You're right. Would it be more appropriate if I checked the filename inside createParser? Something like this:

def createParser(filename, real_filename=None, tags=None):
    """
    Create a parser from a file or returns None on error.

    Options:
    - file (unicode|bytes|io.IOBase): Input file name, byte array
        or a byte io.IOBase stream  ;
    - real_filename (str|unicode): Real file name.
    """
    if not tags:
        tags = []
    if isinstance(filename, str):
        stream = FileInputStream(filename, real_filename, tags=tags)
    else:
        inputio = (filename if isinstance(filename, io.BytesIO)
                   else io.BytesIO(filename))
        filename = real_filename or getattr(inputio, 'name', '')
        source = "file:" + filename
        stream = InputIOStream(inputio, source=source, tags=tags)
    return guessParser(stream)

nneonneo commented 4 years ago

While I agree it's useful to be able to create parsers from a byte array or byte stream, this is already possible with ~2 lines of code:

stream = InputIOStream(BytesIO(data), ...)
parser = guessParser(stream)

so the decision of what kind of stream to use can be pushed up one level to the caller (e.g. telethon, etc.).

It may be useful to allow FileInputStream and createParser to take a file object as well as a filename, in a manner analogous to what gzip.open does from the stdlib; however, I'd worry that letting it take a byte array too would be potentially confusing (open and gzip.open, for example, treat byte strings as raw filenames).

So - maybe the best approach would be to adapt FileInputStream to take a file object as filename, but not a byte array? This would at least give it some symmetry with similar constructs in the stdlib. (In the future we might decide, for instance, that bytes inputs should be considered raw filenames).

allerter commented 4 years ago

What you said makes sense, so I removed handling byte arrays. But should I close the PR anyway in favor of having the caller decide what stream to use?

vstinner commented 4 years ago

Merged, thanks!