vstinner / hachoir

Hachoir is a Python library to view and edit a binary stream field by field
http://hachoir.readthedocs.io/
GNU General Public License v2.0
604 stars 70 forks source link

Added passing byte stream to FileInputStream #55

Closed allerter closed 4 years ago

allerter commented 4 years ago

Telethon uses hachoir to extract metadata from audio and video files. But I read that it only does so for files on the local disk since hachoir doesn't create parsers from byte arrays and byte streams. So I added the ability to create streams from the passed filename, so the parsers could be created from byte array and streams. Two things:

I'd love to hear your thoughts on this.

nneonneo commented 4 years ago

My impression is that FileInputStream should take a file as input, and InputIOStream would be more appropriate for a byte array (wrapped in an io.BytesIO).

allerter commented 4 years ago

You're right. Would it be more appropriate if I checked the filename inside createParser? Something like this:

def createParser(filename, real_filename=None, tags=None):
    """
    Create a parser from a file or returns None on error.

    Options:
    - file (unicode|bytes|io.IOBase): Input file name, byte array
        or a byte io.IOBase stream  ;
    - real_filename (str|unicode): Real file name.
    """
    if not tags:
        tags = []
    if isinstance(filename, str):
        stream = FileInputStream(filename, real_filename, tags=tags)
    else:
        inputio = (filename if isinstance(filename, io.BytesIO)
                   else io.BytesIO(filename))
        filename = real_filename or getattr(inputio, 'name', '')
        source = "file:" + filename
        stream = InputIOStream(inputio, source=source, tags=tags)
    return guessParser(stream)
nneonneo commented 4 years ago

While I agree it's useful to be able to create parsers from a byte array or byte stream, this is already possible with ~2 lines of code:

stream = InputIOStream(BytesIO(data), ...)
parser = guessParser(stream)

so the decision of what kind of stream to use can be pushed up one level to the caller (e.g. telethon, etc.).

It may be useful to allow FileInputStream and createParser to take a file object as well as a filename, in a manner analogous to what gzip.open does from the stdlib; however, I'd worry that letting it take a byte array too would be potentially confusing (open and gzip.open, for example, treat byte strings as raw filenames).

So - maybe the best approach would be to adapt FileInputStream to take a file object as filename, but not a byte array? This would at least give it some symmetry with similar constructs in the stdlib. (In the future we might decide, for instance, that bytes inputs should be considered raw filenames).

allerter commented 4 years ago

What you said makes sense, so I removed handling byte arrays. But should I close the PR anyway in favor of having the caller decide what stream to use?

vstinner commented 4 years ago

Merged, thanks!