Closed JulienPalard closed 3 years ago
in my usage I work on HTTP responses body
Not sure I understand well, but MediaInfo supports well URLs, what is missing natively?
A few details are different between parsing from a file and parsing from a buffer, besides the filename
You didn't implement the full "by buffer" API. See upstream example, especially the "GoTo" part.
@JeromeMartinez does that also work with streams of an unknown size?
does that also work with streams of an unknown size?
As much as possible e.g. for MPEG-TS there will be no seek to the end (it is unknown...) so no duration (computed by end timestamps at the end of the file).
Note: unknown file size has not been extensively tested and may have weird behaviors, it would be considered as bug by me but low priority.
in my usage I work on HTTP responses body
Not sure I understand well, but MediaInfo supports well URLs, what is missing natively?
I'm using pytest, I'm testing an asyncio library working on medias, ... some code may spoke more:
def width_height(media_bytes): # My other PRs are to try to write this more cleanly
with NamedTemporaryFile(suffix=".jpg") as f:
f.write(media_bytes)
f.seek(0)
info = MediaInfo.parse(f.name)
assert info.tracks
return info.tracks[1].width, info.tracks[1].height
async def test_image_resized(cli, tmpdir):
resp = await cli.get("/images/resized/lenna.jpg")
width, height = width_height(await resp.read())
assert resp.status == 200
assert width == 1020
assert height > 10 # depends on the width, only the width is fixed in /resized/
so no, I can't use your embedded curl library to do so, I even bet it's not crafting the HTTP request (as text as we understand), it's probably bypassing directly to the server code. But thanks for caring :)
Performance comparisons on a 1.4GB file
.read()
it takes around 10sMediaInfo_Open_Buffer
, don't know why)So yes I'll go fo the buffer protocol.
Still had no time to check for the speed
parameter, have to eat now.
Given a filename, the demo.py give me 193 lines, versus 78 lines if I modify it to give an opened file, so please don't merge, I need to study this first.
(Also I bet tests may not pass on all tested Python versions, I'm not really fluent in Python 2.7)
No worries, I will look into the Python 2 failures later if there are any.
Even with parse_speed=1
, which is the maximum, I'm getting way less data:
with parse_speed=1
(5s to run)@JeromeMartinez is it a known behavior? I have hard times finding documentation about it, any link?
As expected I'm having a Python 2.7 issue, where a str
can't be converted to a str
(the irony), which I can't reproduce on my machine at the moment.
The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 42: ordinal not in range(128)
Strange fact: \xe9 is latin1, but the file is in UTF-8, so it should be \xc3\xa9
But even with the correct UTF-8 sequence it would fail as it requires for values to be in range(128)
.
@JeromeMartinez is it a known behavior?
The "filename" parsing internally uses the "opened file" parsing. I provide support for my own example only, so please demonstrate an issue with the "by buffer" example I linked.
with parse_speed=1 (5s to run)
Please use this option only for what it is intended for (= parsing the whole file for having more info), it is out of scope here.
Please use this option only for what it is intended for (= parsing the whole file for having more info), it is out of scope here.
That's my fault, I thought maybe the default parse speed for buffers was different :)
I provide support for my own example only, so please demonstrate an issue with the "by buffer" example I linked.
I can understand, and obviously I found my error while producing a minimal reproducible example, sorry for the noise.
It's fixed now, I forgot to call the Finalizer.
@sbraz There's still an issue with Python 2.7, looks like appveyor may be lying, it may not be filename = 'C:\projects\pymediainfo\tests\data\accentué.txt'
that can't be converted to string using str(filename)
, which make no sense, it's probably filename = u'C:\projects\pymediainfo\tests\data\accentué.txt'
, but it's only guessing, and I don't see why filename could be transformed from a str object to a unicode object, can't reproduce it locally.
looks like appveyor may be lying
Not sure why GitHub doesn't show Travis builds (I'll look into it) but the issue is not related to AppVeyor, see https://travis-ci.org/github/sbraz/pymediainfo/
I can reproduce it anywhere with Python 2. If you can't find what causes it, I'll look into it later this week.
OK, I can reproduce it using Python 2.7 (yes I finally created a venv), and yes pytest is lying in saying it's filename = '/home/mdk/clones/pymediainfo/tests/data/accentué.txt'
, in fact it's u'/home/mdk/clones/pymediainfo/tests/data/accentu\xe9.txt'
(according to pdb). I hope I can fix this.
I would have to check what the old code did (I'm not a Python 2 expert either, can't wait for it to really die :smile:), but this works:
diff --git a/pymediainfo/__init__.py b/pymediainfo/__init__.py
index 0d95961..75db2b6 100644
--- a/pymediainfo/__init__.py
+++ b/pymediainfo/__init__.py
@@ -163,6 +163,8 @@ class MediaInfo(object):
def _normalize_filename(filename):
if hasattr(os, "PathLike") and isinstance(filename, os.PathLike):
return os.fspath(filename)
+ elif isinstance(filename, unicode):
+ return filename
return str(filename)
@staticmethod
def _get_library(library_file=None):
Oh, and Travis is back.
As for pytest "lying", this is because I enabled unicode_literals
to make the code easier to write. Otherwise I have to use the u
prefix for non-ASCII strings.
elif isinstance(filename, unicode): + return filename
Yes I bet it work, too, I fallbacked on the old code for safety, which was just returning the unmodified parameter in this case.
I'm not a Python 2 expert either, can't wait for it to really die
Python 2 is officially EOL (no bug fix, no security fix), and according to the logs on https://docs.python.org it is ~dead, I bet most of the traffic is from bots and people like us maintaining compatibility. I would advise to drop Python 2 compatibility in your next major release, like many project did, this still allow Python 2 users to use old releases.
Yeah I know upstream is dead but there are still distros shipping it and I'm sure there are people using it. I'm not going to go as far as to wait until Red Hat 7 is EOL but I want to wait a little while (maybe 2021 or when Travis or AppVeyor remove it).
The fact that someone asked for Python 2.6 support in 2016 makes me want to be cautious, I don't want to drop 2.7 too early and see some pull request to add it back, it's not such a huge burden to maintain in general :)
Damned, tests no longer pass on win32 and I don't have a Windows machine at home to test it :,-(
Looks like there's a difference between:
python3 demo.py tests/data/sample.mp4
and the same demo.py
with this patch:
@@ -21,7 +21,8 @@ def print_frame(text):
def process(fname):
- media_info = MediaInfo.parse(fname)
+ with open(fname, 'rb') as f:
+ media_info = MediaInfo.parse(f)
for track in media_info.tracks:
print_frame(track.track_type)
pprint(track.to_data())
There's an expected diff in the General section (file related), but the tests are spotting a difference between the video tracks.
You can do
self.assertEqual(with_file.tracks[1].to_data(), with_filename.tracks[1].to_data())
and pytest will tell you what is different, it's more verbose with dicts. Maybe some missing attributes (weird that it only happens on Windows, we'll have to ask Jérôme).
Looks like in one side we have:
'encoded_library_name': 'x264',
'encoded_library_version': 'core 144 r11 40bb568',
'encoding_settings': 'cabac=1 / ref=3 / deblock=1:0:0 / analyse=0x3:0x113 / '
'me=hex / subme=7 / psy=1 / psy_rd=1.00:0.00 / '
'mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / '
'8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / '
'chroma_qp_offset=-2 / threads=72 / lookahead_threads=8 '
'/ sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / '
'bluray_compat=0 / stitchable=1 / constrained_intra=0 / '
'bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / '
'direct=1 / weightb=1 / open_gop=0 / weightp=2 / '
'keyint=infinite / keyint_min=23 / scenecut=40 / '
'intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / '
'crf=20.0 / qcomp=0.60 / qpmin=5 / qpmax=69 / qpstep=4 / '
'vbv_maxrate=4950 / vbv_bufsize=13500 / crf_max=0.0 / '
'nal_hrd=none / filler=0 / ip_ratio=1.40 / aq=1:1.00',
but not on the other side, only on win32.
I'll look at it later™, maybe trying on a VM, as I don't have any Windows at home to test, let alone a 32 bits one.
@JulienPalard can you please try updating to the latest MI version in AppVeyor and Travis configs to see if that changes anything?
@JulienPalard can you try rebasing on master to see if the latest MediaInfo update fixes the Win32 problem? This PR adds really nice features, I might make a Win32 VM at some point if we still need to debug this weird difference.
Rebased and took a fresh view on it. It was simply a bad type (size_t instead of ulonglong), and boom it works on Windows 32 bits.
(So it it the day I did Python 2.6 compat (on another project) and Win 32 compat, I have to sleep now).
There's a little overhead in jumping back and forth in the Python interpreter, but nothing that compares to my first "big read" implementation taking 10 s :P
With a 1.4GB file it take around 16ms by name and 29 ms by giving a file-like object, and with a 5MB file it takes around 2 ms by name and 8 ms by giving a file-like object. I did not took the time to fine-tune my system for benchmarks, I just had to verify the values were not around 10 s one last time...
Wow, well done! Thanks a lot for your work, I'll look into merging this. I might add more tests (comparing other samples) just in case.
Added some tests using files from https://github.com/mathiasbynens/small.
By searching if by any chance you used any parametrization thing in unittest (which I rarely use :p), I found you use pytest's parametrization, so I used it too, hope it's OK.
I had started something similar:
# after defining test_files at the beginning of the file
@pytest.mark.parametrize("test_file", test_files)
def test_filelike_returns_the_same(test_file):
filename = os.path.join(data_dir, test_file)
mi_from_filename = MediaInfo.parse(filename)
with open(filename, "rb") as f:
mi_from_file = MediaInfo.parse(f)
for i, t in enumerate(mi_from_filename.tracks):
# The General track will differ, typically not giving the filename.
if t.track_type != "General":
assert mi_from_file.tracks[i] == mi_from_filename.tracks[i]
I will add the examples you provided but I'll keep the explicit list of files, I don't feel like using unpredictable wildcards.
Have you seen the 3 comments I added to your code?
I had started something similar:
# after defining test_files at the beginning of the file @pytest.mark.parametrize("test_file", test_files) def test_filelike_returns_the_same(test_file): filename = os.path.join(data_dir, test_file) mi_from_filename = MediaInfo.parse(filename) with open(filename, "rb") as f: mi_from_file = MediaInfo.parse(f) for i, t in enumerate(mi_from_filename.tracks): # The General track will differ, typically not giving the filename. if t.track_type != "General": assert mi_from_file.tracks[i] == mi_from_filename.tracks[i]
This won't raise if len(mi_from_file) > len(mi_from_filename)
, which should not happen so I don't really care, both are good.
I don't feel like using unpredictable wildcards.
As you wish.
Have you seen the 3 comments I added to your code?
I see 4 resolved reviews from July 3, and nothing more, did I missed something?
This won't raise if len(mi_from_file) > len(mi_from_filename), which should not happen so I don't really care, both are good.
I can make it raise in that case too, good point.
@JulienPalard I've updated my changes: https://github.com/sbraz/pymediainfo/compare/file-like?expand=1
I think this is more or less ready to merge. I see you changed the exception handling to: if "b" not in getattr(filename, "mode", "b"):
.
In which cases doesn't filename have a mode attribute? Old Pythons?
In which cases doesn't filename have a mode attribute? Old Pythons?
I had in mind file-like objects that are not actually open files (io.BytesIO, ...), as long as those objects do expose what's absolutely needed (seek, tell, read (all accepting the right paramters and returning the right type)) it should work. io.BytesIO typically don't have a .mode.
I've updated my changes
Looks good to me!
(Do you know you can even git remote add JulienPalard git@github.com:JulienPalard/pymediainfo
, then push to my branch, so your commit appear in this PR? A bit intrusive maybe, but I find it usefull from times to times)
I had in mind file-like objects that are not actually open files (io.BytesIO, ...), as long as those objects do expose what's absolutely needed (seek, tell, read (all accepting the right paramters and returning the right type)) it should work. io.BytesIO typically don't have a .mode.
I'm adding a comment to mention that.
(Do you know you can even
git remote add JulienPalard git@github.com:JulienPalard/pymediainfo
, then push to my branch, so your commit appear in this PR? A bit intrusive maybe, but I find it usefull from times to times)
I used git fetch origin pull/96/head:pr_filelike
to fetch your PR but I didn't want to push to it since it would make it harder to compare your changes to mine.
I knew I could push to someone's PR, I just didn't remember how ^^
Does GitHub always allow that or only if you select "Allow edits from maintainers"?
I didn't want to push to it since it would make it harder to compare your changes to mine.
Make sense. When I do it, I don't alter commits, I just add commits which also make the comparison easy.
Does GitHub always allow that or only if you select "Allow edits from maintainers"?
I think it's related to this Allow edits
but never checked :p
Don't hesitate if you need anything else on this PR.
(Maybe, if you accept this PR, don't merge it, but squash-and-merge it, as a commit added a mp4 file then another commit removed it, we don't need a useless mp4 in the git history (ok it's a small one but yet)).
Thanks a lot for this Julien :)
Bonjour,
Tried myself at accepting file-like objects (in my usage I work on HTTP responses body, so I'm happy if I don't have to store them on disk).
I tried to keep the diff small, wrote a few tests, and to avoid backward incompatibilities, at the cost of some inconsistencies, like keeping the variable name "filename", just in case someone once typed
.parse(filename=my_file)
.A few details are different between parsing from a file and parsing from a buffer, besides the filename, like:
I'm not fan of the new state of
_parse_filename
.It's hard to resist to the temptation to guess more (between URL and windows path I mean), but the Zen of Python tells me to resist.
Maybe we should move the
open ... pass ...
(which is here to raiseFileNotFoundError
instead ofRuntimeError
) nearAn error occured while opening
, to avoid opening for nothing in case the file exists, and we could use the same test as MediaInfo, which is a bit hard to read for me (elses not aligned properly?) but looks like "If contains ://, try curl, else try file", something like:This would greatly simplify
_parse_filename
(mostly remove it in fact, we would just keep the conversion to string), it would also drop "should_be_openable", which I'm not proud of.Now I cross my fingers to see if the CI passes with all those Python version :)