sbraz / pymediainfo

A Python wrapper around the MediaInfo library
https://pymediainfo.readthedocs.org/
Other
316 stars 57 forks source link

xml reading is outdated #90

Closed srirams closed 4 years ago

srirams commented 4 years ago

it seems the new xml format includes a namespace as well....

https://github.com/sbraz/pymediainfo/blob/7eccc4b15a9d6a7e1ec29b1e9d9d3195bbf4d852/pymediainfo/__init__.py#L158

changing the above to:

        ns = '{https://mediaarea.net/mediainfo}'
        if xml_dom.tag == "File":
            xpath = "track"
        elif xml_dom.tag == f"{ns}MediaInfo":
            ET.register_namespace('', ns)
            xpath = f"{ns}media/{ns}track"
        else:
            xpath = "File/track"

https://github.com/sbraz/pymediainfo/blob/7eccc4b15a9d6a7e1ec29b1e9d9d3195bbf4d852/pymediainfo/__init__.py#L67

adding:

             ns = '{https://mediaarea.net/mediainfo}'
             if node_name.startswith(ns):
               node_name = node_name[len(ns):]

but IMHO the better path may be to dump xml output and use json instead. Ran into this problem because I was getting an error parsing a file (xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 29, column 11), because of some encoding issues- even though the xml seems to be fine, so as a workaround I was reading the xml, converting to ascii and feeding it back in to MediaInfo.

sbraz commented 4 years ago

Hi,

it seems the new xml format includes a namespace as well....

What new format? Is this something in the MediainfoLib's git repo? Version 20.03 works fine, can you show me a way to reproduce this? If you mean --output=XML as opposed to --output=OLDXML, I am aware of it but I can't migrate to the new output without breaking the track structure. I don't plan on changing the way tracks are formatted until the library itself drops support for OLDXML. In the meantime, there might be ways to get the kind of output you want by passing extra parameters (see my next answer).

If you're using the MediaInfo.parse method, you should not even notice that the XML output method was renamed.

use json instead

The JSON output is also quite different so I can't use it without breaking everything either. You can get a JSON str if you set output="JSON". Please check out the documentation and let me know if it helps.

because of some encoding issues

There is an encoding_errors parameter for that. Can you upload a file that exhibits the issue? I remember someone requesting that parameter but I don't remember having a test file. Such a bug should be reported to MediaInfo's upstream and I can take care of that.

srirams commented 4 years ago

oops, I didn't realize there was an OLDXML. I was taking the --output=XML and trying to load it back in.

this is the file I'm having problems with:

<?xml version="1.0" encoding="UTF-8"?>
<MediaInfo
    xmlns="https://mediaarea.net/mediainfo"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="https://mediaarea.net/mediainfo https://mediaarea.net/mediainfo/mediainfo_2_0.xsd"
    version="2.0">
<creatingLibrary version="20.03" url="https://mediaarea.net/MediaInfo">MediaInfoLib</creatingLibrary>
<track type="General">
<VideoCount>1</VideoCount>
<AudioCount>1</AudioCount>
<FileExtension>wmv</FileExtension>
<Format>Windows Media</Format>
<FileSize>306608820</FileSize>
<Duration>473.307</Duration>
<OverallBitRate>5182409</OverallBitRate>
<OverallBitRate_Maximum>5140448</OverallBitRate_Maximum>
<FrameRate>29.970</FrameRate>
<FrameCount>14185</FrameCount>
<StreamSize>3219033</StreamSize>
<HeaderSize>1046</HeaderSize>
<DataSize>306604706</DataSize>
<Encoded_Date>UTC 2011-11-14 19:26:36.000</Encoded_Date>
<File_Created_Date>UTC 2019-03-24 01:27:12.464</File_Created_Date>
<File_Created_Date_Local>2019-03-23 20:27:12.464</File_Created_Date_Local>
<File_Modified_Date>UTC 2011-12-30 09:54:16.000</File_Modified_Date>
<File_Modified_Date_Local>2011-12-30 04:54:16.000</File_Modified_Date_Local>
<Copyright>￾꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
</track>
<track type="Video">
<StreamOrder>0</StreamOrder>
<ID>1</ID>
<Format>VC-1</Format>
<Format_Profile>Main</Format_Profile>
<CodecID>WMV3</CodecID>
<Duration>473.307</Duration>
<BitRate>5000000</BitRate>
<Width>1920</Width>
<Height>1080</Height>
<PixelAspectRatio>1.000</PixelAspectRatio>
<DisplayAspectRatio>1.778</DisplayAspectRatio>
<FrameRate>29.970</FrameRate>
<FrameCount>14185</FrameCount>
<ColorSpace>YUV</ColorSpace>
<ChromaSubsampling>4:2:0</ChromaSubsampling>
<BitDepth>8</BitDepth>
<ScanType>Progressive</ScanType>
<Compression_Mode>Lossy</Compression_Mode>
<StreamSize>295816875</StreamSize>
<extra>
<Duration_Source>General_Duration</Duration_Source>
</extra>
</track>
<track type="Audio">
<StreamOrder>1</StreamOrder>
<ID>2</ID>
<Format>WMA</Format>
<Format_Version>2</Format_Version>
<CodecID>161</CodecID>
<Duration>473.307</Duration>
<BitRate>128000</BitRate>
<Channels>2</Channels>
<SamplingRate>44100</SamplingRate>
<SamplingCount>20872839</SamplingCount>
<BitDepth>16</BitDepth>
<StreamSize>7572912</StreamSize>
<StreamSize_Proportion>0.02470</StreamSize_Proportion>
<extra>
<Duration_Source>General_Duration</Duration_Source>
</extra>
</track>
</media>
</MediaInfo>

edit: I'm reading the file with:

xml = pymediainfo.MediaInfo.parse(file_path, encoding_errors="replace", output="OLDXML")

sbraz commented 4 years ago

I was taking the --output=XML and trying to load it back in.

Any reason why you were not using the built-in parse method? It's faster and more portable (no need for the mediainfo binary, you just need the library and it is bundled in the Windows/OSX wheels).

this is the file I'm having problems with:

Ah, I see. I need the file itself to create an issue though. Can you attach it please (just the few first KiBs should be enough)? I wonder if JSON output is broken as well.

srirams commented 4 years ago

Sorry, should have been clearer. I'm using the built-in parse and storing the json from to_json. When I ran into the problem with this file, I tried to workaround it by using output="XML" from parse, converting to ascii and loading it back in. Unfortunately I didn't realize I should have used "OLDXML" instead :).

I've included a sample file below:

sample.zip

I think its the xml.etree.ElementTree that doesn't like the unicode in the <Copyright> field, although it seems to be valid xml.

sbraz commented 4 years ago

Apparently it is invalid XML:

$ mediainfo --output=OLDXML sample.wmv  | xmllint --format -
-:12: parser error : Char 0xFFFE out of allowed range
<Copyright>￾꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
           ^
-:12: parser error : PCDATA invalid Char value 65534
<Copyright>￾꤀ 刀漀渀 䠀愀爀爀椀猀</Copyright>
           ^

The first character is not a valid unicode character according to Wikipedia:

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

This can mean two things:

Apparently other MediaInfo-based libraries have had similar issues in the past: https://github.com/mhor/php-mediainfo/issues/92

Do you know if there is any reason why this file's copyright contains Chinese characters?

kartouch01 commented 4 years ago

i dont know if the same problem or not, in my environment The XML output miss some parameters like codec value, However it's shown on text Format with -f option;


~/Music$ mediainfo 1335588995-tennis_prog_pal_h264.ts -f General Count : 330 Count of stream of this kind : 1 Kind of stream : General Kind of stream : General Stream identifier : 0 ID : 1 ID : 1 (0x1) Count of video streams : 1 Count of audio streams : 1 Video_Format_List : AVC Video_Format_WithHint_List : AVC Codecs Video : AVC Audio_Format_List : MPEG Audio Audio_Format_WithHint_List : MPEG Audio Audio codecs : MPEG-1 Audio layer 2 Complete name : 1335588995-tennis_prog_pal_h264.ts File name : 1335588995-tennis_prog_pal_h264 File extension : ts Format : MPEG-TS Format : MPEG-TS Format/Extensions usually used : ts m2t m2s m4t m4s tmf ts tp trp ty Commercial name : MPEG-TS Internet media type : video/MP2T Codec : MPEG-TS Codec : MPEG-TS Codec/Extensions usually used : ts m2t m2s m4t m4s tmf ts tp trp ty File size : 9860224 File size : 9.40 MiB File size : 9 MiB File size : 9.4 MiB File size : 9.40 MiB File size : 9.403 MiB Duration : 35766.031250 Duration : 35 s 766 ms Duration : 35 s 766 ms Duration : 35 s 766 ms Duration : 00:00:35.766 Duration : 00:00:34:01 Duration : 00:00:35.766 (00:00:34:01) Overall bit rate mode : CBR Overall bit rate mode : Constant Overall bit rate : 2204864 Overall bit rate : 2 205 kb/s Frame rate : 25.000 Frame rate : 25.000 FPS Frame count : 851 Stream size : 943626 Stream size : 922 KiB (10%) Stream size : 922 KiB Stream size : 922 KiB Stream size : 922 KiB Stream size : 921.5 KiB Stream size : 922 KiB (10%) Proportion of this stream : 0.09570 File last modification date : UTC 2014-10-07 12:25:25 File last modification date (local) : 2014-10-07 14:25:25 OverallBitRate_Precision_Min : 2204833 OverallBitRate_Precision_Max : 2204895

Video Count : 342 Count of stream of this kind : 1 Kind of stream : Video Kind of stream : Video Stream identifier : 0 StreamOrder : 0-0 ID : 289 ID : 289 (0x121) Menu ID : 1 Menu ID : 1 (0x1) Format : AVC Format/Info : Advanced Video Codec Format/Url : http://developers.videolan.org/x264.html Commercial name : AVC Format profile : High@L3 Format settings : 1 Ref Frames Format settings, CABAC : No Format settings, CABAC : No Format settings, ReFrames : 1 Format settings, ReFrames : 1 frame Internet media type : video/H264 Codec ID : 27 Codec : AVC Codec : AVC Codec/Family : AVC Codec/Info : Advanced Video Codec Codec/Url : http://developers.videolan.org/x264.html Codec profile : High@L3 Codec settings : 1 Ref Frames Codec settings, CABAC : No Codec_Settings_RefFrames : 1 Duration : 34040 Duration : 34 s 40 ms Duration : 34 s 40 ms Duration : 34 s 40 ms Duration : 00:00:34.040 Duration : 00:00:34:01 Duration : 00:00:34.040 (00:00:34:01) Bit rate mode : CBR Bit rate mode : Constant Bit rate : 2000000 Bit rate : 2 000 kb/s Width : 720 Width : 720 pixels Height : 576 Height : 576 pixels Sampled_Width : 720 Sampled_Height : 576 Pixel aspect ratio : 1.067 Display aspect ratio : 1.333 Display aspect ratio : 4:3 Frame rate : 25.000 Frame rate : 25.000 FPS Frame count : 851 Standard : PAL Resolution : 8 Resolution : 8 bits Colorimetry : 4:2:0 Color space : YUV Chroma subsampling : 4:2:0 Chroma subsampling : 4:2:0 Bit depth : 8 Bit depth : 8 bits Scan type : Progressive Scan type : Progressive Interlacement : PPF Interlacement : Progressive Bits/(Pixel*Frame) : 0.193 Delay : 2104.067 Delay : 2 s 104 ms Delay : 2 s 104 ms Delay : 2 s 104 ms Delay : 00:00:02.104 Delay, origin : Container Delay, origin : Container Stream size : 8780662 Stream size : 8.37 MiB (89%) Stream size : 8 MiB Stream size : 8.4 MiB Stream size : 8.37 MiB Stream size : 8.374 MiB Stream size : 8.37 MiB (89%) Proportion of this stream : 0.89051 Buffer size : 4000768

Audio Count : 275 Count of stream of this kind : 1 Kind of stream : Audio Kind of stream : Audio Stream identifier : 0 StreamOrder : 0-1 ID : 297 ID : 297 (0x129) Menu ID : 1 Menu ID : 1 (0x1) Format : MPEG Audio Commercial name : MPEG Audio Format version : Version 1 Format profile : Layer 2 Internet media type : audio/mpeg Codec ID : 3 Codec : MPA1L2 Codec : MPEG-1 Audio layer 2 Duration : 33984 Duration : 33 s 984 ms Duration : 33 s 984 ms Duration : 33 s 984 ms Duration : 00:00:33.984 Duration : 00:00:33:20 Duration : 00:00:33.984 (00:00:33:20) Bit rate mode : CBR Bit rate mode : Constant Bit rate : 32000 Bit rate : 32.0 kb/s Channel(s) : 1 Channel(s) : 1 channel Samples per frame : 1152 Sampling rate : 32000 Sampling rate : 32.0 kHz Samples count : 1087488 Frame rate : 27.778 Frame rate : 27.778 FPS (1152 SPF) Frame count : 944 Compression mode : Lossy Compression mode : Lossy Delay : 2000.378 Delay : 2 s 0 ms Delay : 2 s 0 ms Delay : 2 s 0 ms Delay : 00:00:02.000 Delay, origin : Container Delay, origin : Container Delay relative to video : -104 Delay relative to video : -104 ms Delay relative to video : -104 ms Delay relative to video : -104 ms Delay relative to video : -00:00:00.104 Video0 delay : -104 Video0 delay : -104 ms Video0 delay : -104 ms Video0 delay : -104 ms Video0 delay : -00:00:00.104 Stream size : 135936 Stream size : 133 KiB (1%) Stream size : 133 KiB Stream size : 133 KiB Stream size : 133 KiB Stream size : 132.8 KiB Stream size : 133 KiB (1%) Proportion of this stream : 0.01379


XML Output:

mediainfo --Output=XML 1335588995-tennis_prog_pal_h264.ts -f

Mediainfo (1).txt

Does this is related to XML ? #

sbraz commented 4 years ago

codec value

You must be running an old version of the library for your mediainfo CLI. Codec was replaced with Format a long time ago.

Also the attached MediaInfo XML output is not for 1335588995-tennis_prog_pal_h264.ts so I can't really compare it to the non-XML version. I doubt there is a bug here, but if there is, you need to report it to MediaInfo directly.

JeromeMartinez commented 4 years ago

Apparently it is invalid XML [...] BOM [...]

I am fixing that, but it will be only in newest version of the lib.

sbraz commented 4 years ago

Hi @JeromeMartinez, thanks! Will this just remove the BOM or also change the endianness? I still don't know if those Chinese characters are valid or if they are some kind of glitch due to the wrong endianness being used.

JeromeMartinez commented 4 years ago

Will this just remove the BOM or also change the endianness?

Well, I have read too quickly and didn't catch the wrong order (UTF-16BE instead of expected UTF-16LE in WM files).

I added a commit for reordering bytes if such issue appears. For the example, it makes more sense (Copyright becomes "© Ron Harris").

sbraz commented 4 years ago

For the example, it makes more sense (Copyright becomes "© Ron Harris").

Thanks, it looks more sensible indeed!

sbraz commented 4 years ago

Closing this since it is a MediaInfo issue.