openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

Unexpected behaviour running tests on openjpeg-data #182

Closed bitsgalore closed 2 years ago

bitsgalore commented 2 years ago

Running a modified version of the jpylyzer test corpus tests on openjpeg-data give some weird results. See below script (for simplicity input restricted to input/conformance dir):

#! /usr/bin/env python3

import os
import glob
import pytest
from lxml import etree

from jpylyzer import config
from jpylyzer.jpylyzer import checkOneFile
from jpylyzer.jpylyzer import checkFiles

# Directory that contains this script
SCRIPT_DIR = os.path.dirname(os.path.realpath(__file__))

# Root dir of jpylyzer repo
JPYLYZER_DIR = os.path.split(os.path.split(SCRIPT_DIR)[0])[0]

# XSD file (path resolved from SCRIPT_DIR)
xsdFile = os.path.join(JPYLYZER_DIR, "xsd/jpylyzer-v-2-1.xsd")

# Directory with test files
testFilesDir = "/home/johan/openjpeg-data/input/conformance"

# All files in test files dir
testFiles = glob.glob(testFilesDir + "/**", recursive=True)
testFiles = [f for f in testFiles if os.path.isfile(f)]

@pytest.mark.parametrize('input', testFiles)

def test_xml_onefile(input):
    config.VALIDATION_FORMAT = "jp2"
    filesIn = [input]
    checkFiles(False, True, filesIn)

Problem 1: namespace issue

Majority of files fail with error below:

jpylyzer/jpylyzer.py:764: in checkFiles
    writeElement(xmlElement, out)
jpylyzer/jpylyzer.py:683: in writeElement
    xmlPretty = minidom.parseString(xmlOut).toprettyxml('    ')
/usr/lib/python3.8/xml/dom/minidom.py:1969: in parseString
    return expatbuilder.parseString(string)
/usr/lib/python3.8/xml/dom/expatbuilder.py:925: in parseString
    return builder.parseString(string)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xml.dom.expatbuilder.ExpatBuilderNS object at 0x7f72ca86ca00>
string = '<file xmlns:ns0="http://www.jpeg.org/jpx/1.0/xml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http:/...ION>\n\t</ns0:EVENT>\n</ns0:CONTENT_DESCRIPTION></xmlBox><compressionRatio>1.86</compressionRatio></properties></file>'

    def parseString(self, string):
        """Parse a document from a string, returning the document node."""
        parser = self.getParser()
        try:
>           parser.Parse(string, True)
E           xml.parsers.expat.ExpatError: duplicate attribute: line 1, column 156

/usr/lib/python3.8/xml/dom/expatbuilder.py:223: ExpatError

Going by the error description the problem is a duplicate attribute definition in the XML. By adding a debug statement just before the toprettyxml call I was able to narrow this down for "file8.jp2" to a duplicate definition of the xmlns:xsi namespace. First it is declared in the root element:

<jpylyzer xmlns="http://openpreservation.org/ns/jpylyzer/v2/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://openpreservation.org/ns/jpylyzer/v2/ http://jpylyzer.openpreservation.org/jpylyzer-v-2-1.xsd">

and then again in the file element:

<file xmlns:ns0="http://www.jpeg.org/jpx/1.0/xml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://openpreservation.org/ns/jpylyzer/v2/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://openpreservation.org/ns/jpylyzer/v2/ http://jpylyzer.openpreservation.org/jpylyzer-v-2-1.xsd">

This happens the ns declaration in the root element is inserted using simple text manipulation, so the XML writer that processes the xmlElement for each file is unaware of it:

        xmlHead += "xmlns:xsi=\"" + XSI_NS_STRING + "\" "

Oddly I'm not able to reproduce this error outside of PyTest (running jpylyzer directly on affected files does not result in any issues, and XML is valid).

BUT I don't think this can cause the fail in PyTest!

Problem 2: missing files in output

Also tried this:

python3 ~/jpylyzer/cli.py ~/openjpeg-data/input/conformance/* > conf-all.xml

Resulting XML contains output for 57 files, but directory contains 59 files.

Needs further investigation.

Update: this seems to be a bug in my text editor; double-check with xmllint confirmed no. of elements is in fact identical to no. of files!

bitsgalore commented 2 years ago

Update: namespace issue is caused because I accidentally ran checkFiles without setting:

config.INPUT_WRAPPER_FLAG = True

Slightly confusing bc default value (as per config file) is "False", is later set to "True". Fixed by https://github.com/openpreserve/jpylyzer/commit/0906e7341383e6d96336a282ff96ef44b0f47223.