openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

Add an optional MIX output if isValidJP2 #121

Closed tledoux closed 4 years ago

tledoux commented 4 years ago

An new --mix [0|1|2] argument is added to ask for a MIX output in version 1.0 or 2.0.

A new propertiesExtension tag is added at the end of the normal output. It is either empty or it embeds the MIX fragment. For backward compatibility, the new tag is not output if no mix asked (default behaviour).

bitsgalore commented 4 years ago

Hi Thomas,

First of all thanks for this contribution. I did a quick test with the jpylyzer-test-files dataset, which brought up a number of issues:

  1. The value of <mix:codec> is based on the contents of the (first?) codestream comment. This is not a reliable mechanism for establishing the codec name and version, because not all codecs store their name in a codestream comment, and codestream comments can be used for any arbitrary information. As an example, for some images in jpylyzer-test-files it results in things like <mix:codec>Jpylyzer demo</mix:codec>.

  2. The MIX code crashes on various files in jpylyzer-test-files that have names that contain non-ASCII characters (Japanese, Chinese, and also the surrogate pair files). Jpylyzer should be able to handle these.

  3. The mix output doesn't include references to the MIX 1.0/2.0 schemas, so there's no easy way to validate it.

  4. I would suggest to drop the 0 choice of the --mix option, as it doesn't appear to have any effect.

  5. Maybe I'm doing something wrong here, but this pull request results in some merge conflicts with your previous pull request (which I already merged into the master branch earlier).

There's also the more general question whether including alternative output formats is something that should be handled by jpylyzer. A few years ago someone already opened this issue with a request for MIX support. At the time I was hesitant to include this for a number of reasons that you see in my comment to that thread (although your solution to the resolution issue looks like a good one to me). Related to this I'm also somewhat concerned about the potential impact of including this on future maintenance efforts. This is partially because some of the mappings to MIX are slightly arbitrary, so potential users of this feature may have different needs/expectations, which makes me wonder if we should not keep this out of jpylyzer, and leave this up to the users (who can then use their own xslt transformations, if they so wish). I'll have to give this some more thought. Perhaps @carlwilson can offer some thoughts on that too?

If we decide to include these changes (based on an updated pull request that fixes the issues mentioned above), I would be in favour of moving the MIX-related code to a separate module, as the jpylyzer.py module is already over 700 lines as it is, and adding another 250 lines makes it a bit on the heavy side.

tledoux commented 4 years ago

Hi Johan, thanks for looking into this. I will try to look at your findings but I'm currently a little remote from my working computer (vacations ;)

There should be any difficulties correcting what you found and I should have tested against the jpylyzer-test-files, my bad !!!

More generally, there are 2 reasons that decide me to propose this PR:

  1. we currently make the transformation of jpylyzer output to mix with a XSLT transformation but this process is complex (we had to add external functions in java) and not very robust (every new feature in an image ends up with an invalid mix). It's my opinion that outputing mix directly from the analysis will correct those shortcomes.
  2. I implement a mix output for the JP2 module of jhove. But I really think that JP2 validation should be made with jpylyzer and I don't want a lack of functionnality forbids the drop of the JP2-HUL module.
tledoux commented 4 years ago

Hi Johan, the last commits should answer your questions. I just leave the --mix 0 which explicitly ask for no mix output, instead of relying to the default behaviour.

The code has been tranferred to a specific class : much cleaner.

I also had to modify the Travis setup because it seems that 3.2 and 3.3 version of python are not available anymore... By the way, the problem with the nonASCII files where not related to the characters but it was a simple bug when retrieving non existing properties from a dictionnary.

Hope to have a better base for discussion

bitsgalore commented 4 years ago

Hi Thomas,

Many thanks for these changes, this is excellent (and yes much cleaner as well). I just managed to merge your modifications into a branch that uses the (so far provisional) v2 output; you can find it here:

https://github.com/openpreserve/jpylyzer/commits/mixOutputv2

I've made some minor changes (some small formatting issues, a missing import + some renaming for clarity). I also created an updated XSD schema. I'll probably need to fiddle a bit more with this, but overall I think this now looks pretty good. Also your remark about the JHOVE JP2 module makes sense, so yes having the possibility to do this directly in jpylyzer would probably be very useful to a lot of (potential) jpylyzer users.

tledoux commented 4 years ago

Hi Johan, thanks for taking care of all those "details". I see you're putting the documentation in sync as well as the schema. Everything seems great !!!

bitsgalore commented 4 years ago

Pull request was merged manually in branch jpylyzer2-dev, see https://github.com/openpreserve/jpylyzer/commit/f27ffe6f36cfb5bd4a28cdf7615c56f87bbe0dae. Closing.