Also accept file-like objects

JulienPalard commented 4 years ago

Bonjour,

Tried myself at accepting file-like objects (in my usage I work on HTTP responses body, so I'm happy if I don't have to store them on disk).

I tried to keep the diff small, wrote a few tests, and to avoid backward incompatibilities, at the cost of some inconsistencies, like keeping the variable name "filename", just in case someone once typed .parse(filename=my_file).

A few details are different between parsing from a file and parsing from a buffer, besides the filename, like:

(Pdb) p opened_using_filename.tracks[1].to_data()["encoding_settings"]
'cabac=1 / ref=3 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=-2 / threads=72 / lookahead_threads=8 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / stitchable=1 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=2 / keyint=infinite / keyint_min=23 / scenecut=40 / intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / crf=20.0 / qcomp=0.60 / qpmin=5 / qpmax=69 / qpstep=4 / vbv_maxrate=4950 / vbv_bufsize=13500 / crf_max=0.0 / nal_hrd=none / filler=0 / ip_ratio=1.40 / aq=1:1.00'
(Pdb) p opened_using_file_obj.tracks[1].to_data()["encoding_settings"]
*** KeyError: 'encoding_settings'

I'm not fan of the new state of _parse_filename.

It's hard to resist to the temptation to guess more (between URL and windows path I mean), but the Zen of Python tells me to resist.

Maybe we should move the open ... pass ... (which is here to raise FileNotFoundError instead of RuntimeError) near An error occured while opening, to avoid opening for nothing in case the file exists, and we could use the same test as MediaInfo, which is a bit hard to read for me (elses not aligned properly?) but looks like "If contains ://, try curl, else try file", something like:

if lib.MediaInfo_Open(handle, filename) == 0:
    lib.MediaInfo_Close(handle)
    lib.MediaInfo_Delete(handle)
    if "://" not in filename:
        with open(filename): pass
    raise RuntimeError("An error occured while opening {}"
                       " with libmediainfo".format(filename))

This would greatly simplify _parse_filename (mostly remove it in fact, we would just keep the conversion to string), it would also drop "should_be_openable", which I'm not proud of.

Now I cross my fingers to see if the CI passes with all those Python version :)

JeromeMartinez commented 4 years ago

in my usage I work on HTTP responses body

Not sure I understand well, but MediaInfo supports well URLs, what is missing natively?

A few details are different between parsing from a file and parsing from a buffer, besides the filename

You didn't implement the full "by buffer" API. See upstream example, especially the "GoTo" part.

sbraz commented 4 years ago

@JeromeMartinez does that also work with streams of an unknown size?

JeromeMartinez commented 4 years ago

does that also work with streams of an unknown size?

As much as possible e.g. for MPEG-TS there will be no seek to the end (it is unknown...) so no duration (computed by end timestamps at the end of the file).

Note: unknown file size has not been extensively tested and may have weird behaviors, it would be considered as bug by me but low priority.

JulienPalard commented 4 years ago

in my usage I work on HTTP responses body

Not sure I understand well, but MediaInfo supports well URLs, what is missing natively?

I'm using pytest, I'm testing an asyncio library working on medias, ... some code may spoke more:

def width_height(media_bytes): # My other PRs are to try to write this more cleanly
    with NamedTemporaryFile(suffix=".jpg") as f:
        f.write(media_bytes)
        f.seek(0)
        info = MediaInfo.parse(f.name)
    assert info.tracks
    return info.tracks[1].width, info.tracks[1].height

async def test_image_resized(cli, tmpdir):
    resp = await cli.get("/images/resized/lenna.jpg")
    width, height = width_height(await resp.read())
    assert resp.status == 200
    assert width == 1020
    assert height > 10  # depends on the width, only the width is fixed in /resized/

so no, I can't use your embedded curl library to do so, I even bet it's not crafting the HTTP request (as text as we understand), it's probably bypassing directly to the server code. But thanks for caring :)

JulienPalard commented 4 years ago

Performance comparisons on a 1.4GB file

With my ugly .read() it takes around 10s
By mmaping the file it takes around 10s too (most spent inside MediaInfo_Open_Buffer, don't know why)
By name it takes 0.17s
By using the buffer protocol 0.17s

So yes I'll go fo the buffer protocol.

JulienPalard commented 4 years ago

Still had no time to check for the speed parameter, have to eat now.

Given a filename, the demo.py give me 193 lines, versus 78 lines if I modify it to give an opened file, so please don't merge, I need to study this first.

(Also I bet tests may not pass on all tested Python versions, I'm not really fluent in Python 2.7)

sbraz commented 4 years ago

No worries, I will look into the Python 2 failures later if there are any.

JulienPalard commented 4 years ago

Even with parse_speed=1, which is the maximum, I'm getting way less data:

158 key-value-pairs by filename (.1s to run)
66 key-value-pairs by opened file (.1s to run)
69 key-value-pairs by opened file with parse_speed=1 (5s to run)

@JeromeMartinez is it a known behavior? I have hard times finding documentation about it, any link?

JulienPalard commented 4 years ago

As expected I'm having a Python 2.7 issue, where a str can't be converted to a str (the irony), which I can't reproduce on my machine at the moment.

The error is:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 42: ordinal not in range(128)

Strange fact: \xe9 is latin1, but the file is in UTF-8, so it should be \xc3\xa9

But even with the correct UTF-8 sequence it would fail as it requires for values to be in range(128).

JeromeMartinez commented 4 years ago

@JeromeMartinez is it a known behavior?

The "filename" parsing internally uses the "opened file" parsing. I provide support for my own example only, so please demonstrate an issue with the "by buffer" example I linked.

with parse_speed=1 (5s to run)

Please use this option only for what it is intended for (= parsing the whole file for having more info), it is out of scope here.

sbraz commented 4 years ago

Please use this option only for what it is intended for (= parsing the whole file for having more info), it is out of scope here.

That's my fault, I thought maybe the default parse speed for buffers was different :)

JulienPalard commented 4 years ago

I provide support for my own example only, so please demonstrate an issue with the "by buffer" example I linked.

I can understand, and obviously I found my error while producing a minimal reproducible example, sorry for the noise.

It's fixed now, I forgot to call the Finalizer.

JulienPalard commented 4 years ago

@sbraz There's still an issue with Python 2.7, looks like appveyor may be lying, it may not be filename = 'C:\projects\pymediainfo\tests\data\accentué.txt' that can't be converted to string using str(filename), which make no sense, it's probably filename = u'C:\projects\pymediainfo\tests\data\accentué.txt', but it's only guessing, and I don't see why filename could be transformed from a str object to a unicode object, can't reproduce it locally.

sbraz commented 4 years ago

looks like appveyor may be lying

Not sure why GitHub doesn't show Travis builds (I'll look into it) but the issue is not related to AppVeyor, see https://travis-ci.org/github/sbraz/pymediainfo/

I can reproduce it anywhere with Python 2. If you can't find what causes it, I'll look into it later this week.

JulienPalard commented 4 years ago

OK, I can reproduce it using Python 2.7 (yes I finally created a venv), and yes pytest is lying in saying it's filename = '/home/mdk/clones/pymediainfo/tests/data/accentué.txt', in fact it's u'/home/mdk/clones/pymediainfo/tests/data/accentu\xe9.txt' (according to pdb). I hope I can fix this.

sbraz commented 4 years ago

I would have to check what the old code did (I'm not a Python 2 expert either, can't wait for it to really die :smile:), but this works:

diff --git a/pymediainfo/__init__.py b/pymediainfo/__init__.py
index 0d95961..75db2b6 100644
--- a/pymediainfo/__init__.py
+++ b/pymediainfo/__init__.py
@@ -163,6 +163,8 @@ class MediaInfo(object):
     def _normalize_filename(filename):
         if hasattr(os, "PathLike") and isinstance(filename, os.PathLike):
             return os.fspath(filename)
+        elif isinstance(filename, unicode):
+            return filename
         return str(filename)
     @staticmethod
     def _get_library(library_file=None):

Oh, and Travis is back.

sbraz commented 4 years ago

As for pytest "lying", this is because I enabled unicode_literals to make the code easier to write. Otherwise I have to use the u prefix for non-ASCII strings.

JulienPalard commented 4 years ago

        elif isinstance(filename, unicode):
+            return filename

Yes I bet it work, too, I fallbacked on the old code for safety, which was just returning the unmodified parameter in this case.

JulienPalard commented 4 years ago

I'm not a Python 2 expert either, can't wait for it to really die

Python 2 is officially EOL (no bug fix, no security fix), and according to the logs on https://docs.python.org it is ~dead, I bet most of the traffic is from bots and people like us maintaining compatibility. I would advise to drop Python 2 compatibility in your next major release, like many project did, this still allow Python 2 users to use old releases.

sbraz commented 4 years ago

Yeah I know upstream is dead but there are still distros shipping it and I'm sure there are people using it. I'm not going to go as far as to wait until Red Hat 7 is EOL but I want to wait a little while (maybe 2021 or when Travis or AppVeyor remove it).

The fact that someone asked for Python 2.6 support in 2016 makes me want to be cautious, I don't want to drop 2.7 too early and see some pull request to add it back, it's not such a huge burden to maintain in general :)

JulienPalard commented 4 years ago

Damned, tests no longer pass on win32 and I don't have a Windows machine at home to test it :,-(

Looks like there's a difference between:

python3 demo.py tests/data/sample.mp4

and the same demo.py with this patch:

@@ -21,7 +21,8 @@ def print_frame(text):

 def process(fname):
-    media_info = MediaInfo.parse(fname)
+    with open(fname, 'rb') as f:
+        media_info = MediaInfo.parse(f)
     for track in media_info.tracks:
         print_frame(track.track_type)
         pprint(track.to_data())

There's an expected diff in the General section (file related), but the tests are spotting a difference between the video tracks.

sbraz commented 4 years ago

You can do

       self.assertEqual(with_file.tracks[1].to_data(), with_filename.tracks[1].to_data())

and pytest will tell you what is different, it's more verbose with dicts. Maybe some missing attributes (weird that it only happens on Windows, we'll have to ask Jérôme).

JulienPalard commented 4 years ago

Looks like in one side we have:

         'encoded_library_name': 'x264',
         'encoded_library_version': 'core 144 r11 40bb568',
         'encoding_settings': 'cabac=1 / ref=3 / deblock=1:0:0 / analyse=0x3:0x113 / '
                              'me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / '
                              'mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / '
                              '8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / '
                              'chroma_qp_offset=-2 / threads=72 / lookahead_threads=8 '
                              '/ sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / '
                              'bluray_compat=0 / stitchable=1 / constrained_intra=0 / '
                              'bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / '
                              'direct=1 / weightb=1 / open_gop=0 / weightp=2 / '
                              'keyint=infinite / keyint_min=23 / scenecut=40 / '
                              'intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / '
                              'crf=20.0 / qcomp=0.60 / qpmin=5 / qpmax=69 / qpstep=4 / '
                              'vbv_maxrate=4950 / vbv_bufsize=13500 / crf_max=0.0 / '
                              'nal_hrd=none / filler=0 / ip_ratio=1.40 / aq=1:1.00',

but not on the other side, only on win32.

JulienPalard commented 4 years ago

I'll look at it later™, maybe trying on a VM, as I don't have any Windows at home to test, let alone a 32 bits one.

sbraz commented 4 years ago

@JulienPalard can you please try updating to the latest MI version in AppVeyor and Travis configs to see if that changes anything?

sbraz commented 4 years ago

@JulienPalard can you try rebasing on master to see if the latest MediaInfo update fixes the Win32 problem? This PR adds really nice features, I might make a Win32 VM at some point if we still need to debug this weird difference.

JulienPalard commented 4 years ago

Rebased and took a fresh view on it. It was simply a bad type (size_t instead of ulonglong), and boom it works on Windows 32 bits.

(So it it the day I did Python 2.6 compat (on another project) and Win 32 compat, I have to sleep now).

JulienPalard commented 4 years ago

There's a little overhead in jumping back and forth in the Python interpreter, but nothing that compares to my first "big read" implementation taking 10 s :P

With a 1.4GB file it take around 16ms by name and 29 ms by giving a file-like object, and with a 5MB file it takes around 2 ms by name and 8 ms by giving a file-like object. I did not took the time to fine-tune my system for benchmarks, I just had to verify the values were not around 10 s one last time...

sbraz commented 4 years ago

Wow, well done! Thanks a lot for your work, I'll look into merging this. I might add more tests (comparing other samples) just in case.

JulienPalard commented 4 years ago

Added some tests using files from https://github.com/mathiasbynens/small.

By searching if by any chance you used any parametrization thing in unittest (which I rarely use :p), I found you use pytest's parametrization, so I used it too, hope it's OK.

sbraz commented 4 years ago

I had started something similar:

# after defining test_files at the beginning of the file
@pytest.mark.parametrize("test_file", test_files)
def test_filelike_returns_the_same(test_file):
    filename = os.path.join(data_dir, test_file)
    mi_from_filename = MediaInfo.parse(filename)
    with open(filename, "rb") as f:
        mi_from_file = MediaInfo.parse(f)
    for i, t in enumerate(mi_from_filename.tracks):
        # The General track will differ, typically not giving the filename.
        if t.track_type != "General":
            assert mi_from_file.tracks[i] == mi_from_filename.tracks[i]

I will add the examples you provided but I'll keep the explicit list of files, I don't feel like using unpredictable wildcards.

Have you seen the 3 comments I added to your code?

JulienPalard commented 4 years ago

I had started something similar:

# after defining test_files at the beginning of the file
@pytest.mark.parametrize("test_file", test_files)
def test_filelike_returns_the_same(test_file):
    filename = os.path.join(data_dir, test_file)
    mi_from_filename = MediaInfo.parse(filename)
    with open(filename, "rb") as f:
        mi_from_file = MediaInfo.parse(f)
    for i, t in enumerate(mi_from_filename.tracks):
        # The General track will differ, typically not giving the filename.
        if t.track_type != "General":
            assert mi_from_file.tracks[i] == mi_from_filename.tracks[i]

This won't raise if len(mi_from_file) > len(mi_from_filename), which should not happen so I don't really care, both are good.

I don't feel like using unpredictable wildcards.

As you wish.

Have you seen the 3 comments I added to your code?

I see 4 resolved reviews from July 3, and nothing more, did I missed something?

sbraz commented 4 years ago

This won't raise if len(mi_from_file) > len(mi_from_filename), which should not happen so I don't really care, both are good.

I can make it raise in that case too, good point.

sbraz commented 4 years ago

@JulienPalard I've updated my changes: https://github.com/sbraz/pymediainfo/compare/file-like?expand=1

I think this is more or less ready to merge. I see you changed the exception handling to: if "b" not in getattr(filename, "mode", "b"):.

In which cases doesn't filename have a mode attribute? Old Pythons?

JulienPalard commented 4 years ago

In which cases doesn't filename have a mode attribute? Old Pythons?

I had in mind file-like objects that are not actually open files (io.BytesIO, ...), as long as those objects do expose what's absolutely needed (seek, tell, read (all accepting the right paramters and returning the right type)) it should work. io.BytesIO typically don't have a .mode.

I've updated my changes

Looks good to me!

(Do you know you can even git remote add JulienPalard git@github.com:JulienPalard/pymediainfo, then push to my branch, so your commit appear in this PR? A bit intrusive maybe, but I find it usefull from times to times)

sbraz commented 4 years ago

I had in mind file-like objects that are not actually open files (io.BytesIO, ...), as long as those objects do expose what's absolutely needed (seek, tell, read (all accepting the right paramters and returning the right type)) it should work. io.BytesIO typically don't have a .mode.

I'm adding a comment to mention that.

(Do you know you can even git remote add JulienPalard git@github.com:JulienPalard/pymediainfo, then push to my branch, so your commit appear in this PR? A bit intrusive maybe, but I find it usefull from times to times)

I used git fetch origin pull/96/head:pr_filelike to fetch your PR but I didn't want to push to it since it would make it harder to compare your changes to mine.

I knew I could push to someone's PR, I just didn't remember how ^^

Does GitHub always allow that or only if you select "Allow edits from maintainers"?

JulienPalard commented 4 years ago

I didn't want to push to it since it would make it harder to compare your changes to mine.

Make sense. When I do it, I don't alter commits, I just add commits which also make the comparison easy.

Does GitHub always allow that or only if you select "Allow edits from maintainers"?

I think it's related to this Allow edits but never checked :p

Don't hesitate if you need anything else on this PR.

JulienPalard commented 4 years ago

(Maybe, if you accept this PR, don't merge it, but squash-and-merge it, as a commit added a mp4 file then another commit removed it, we don't need a useless mp4 in the git history (ok it's a small one but yet)).

sbraz commented 3 years ago

Thanks a lot for this Julien :)

sbraz / pymediainfo

Also accept file-like objects #96