openpreserve / jpylyzer

JP2 (JPEG 2000 Part 1) validator and properties extractor. Jpylyzer was specifically created to check that a JP2 file really conforms to the format's specifications. Additionally jpylyzer is able to extract technical characteristics.
http://jpylyzer.openpreservation.org/
Other
69 stars 28 forks source link

Feature request - auto-detect format #166

Closed boxerab closed 1 year ago

boxerab commented 4 years ago

Since there are magic byes at beginning of jp2 and also marker bytes for beginning of j2c, it would be useful to not have to specify the format, and look for the special bytes instead.

I am using jpylyzer to analyze a large directory of jp2 and j2c images, looking for the type of wavelet used, and currently I can only parse either jp2 or j2c.

Thanks!

boxerab commented 4 years ago

btw, this is the command I am using to parse the files - might be useful to have in the docs:

jpylyzer --format j2c -r $DIRECTORY --verbose 2> /dev/null | perl -ne 'print if /filePath/ xor /\btransformation\b/'

bitsgalore commented 4 years ago

Auto-detection is something I did consider at some point, but there are several situations where this feature would lead to unexpected results, mainly:

  1. It's perfectly possible that input files aren't either jp2 or j2c at all! In that case the validation format is undefined (there would be ways to handle that, but this would require changes to the output format).

  2. It would lead to unexpected results for files where the magic bytes themselves are wrong. The classic example are the JPX (JPEG 2000 Part 2) files from Adobe Photoshop that have a 'brand' value that falsely identifies them as JP2 (JPEG 2000 Part 1).

  3. If we'd ever decide to add JPX validation functionality to jpylyzer (unlikely at this point, but who knows), auto-detect would further increase the impact of 2. above.

In addition to that, previous experiences with the JHOVE tool have shown that combining format identification and validation into one single tool can lead to various weird and unexpected problems, so I'm a bit hesitant to add this feature.

For your specific use case you could also write a simple bash script that recursively goes through all files in your directory, and then for each file:

  1. Use a dedicated format identification tool (Unix File, Fido, Siegfried, etc.) to establish the file's format.

  2. Use the outcome of step 1 to set the corresponding --format value, and then run jpylyzer on the file.

carlwilson commented 4 years ago

I will add a couple of examples to the documentation that show how to put these tools together in a suitable workflow.

boxerab commented 4 years ago

Hi Johan, thanks for the detailed reply;

Auto-detection is something I did consider at some point, but there are several situations where this feature would lead to unexpected results, mainly:

1. It's perfectly possible that input files aren't either jp2 or j2c at all! In that case the validation format is undefined (there would be ways to handle that, but this would require changes to the output format).

How is this currently handled? You could handle it the same way for auto-detection.

2. It would lead to unexpected results for files where the magic bytes themselves are wrong. The classic example are the [JPX (JPEG 2000 Part 2) files from Adobe Photoshop that have a 'brand' value that falsely identifies them as JP2 (JPEG 2000 Part 1)](http://wiki.opf-labs.org/display/TR/Files+identified+as+JP2+are+really+JPX).

Interesting, didn't know about that issue. What happens currently with those files?

3. If we'd ever decide to add JPX validation functionality to jpylyzer (unlikely at this point, but who knows), auto-detect would further increase the impact of 2. above.

If JPX was added, then auto-detect would just look for JPX magic.

In addition to that, previous experiences with the JHOVE tool have shown that combining format identification and validation into one single tool can lead to various weird and unexpected problems, so I'm a bit hesitant to add this feature.

For your specific use case you could also write a simple bash script that recursively goes through all files in your directory, and then for each file:

1. Use a dedicated format identification tool (Unix File, [Fido](https://github.com/openpreserve/fido), [Siegfried](https://www.itforarchivists.com/siegfried), etc.) to establish the file's format.

2. Use the outcome of step 1 to set the corresponding `--format` value, and then run jpylyzer on the file.

Thanks, I will try that.

boxerab commented 4 years ago

I will add a couple of examples to the documentation that show how to put these tools together in a suitable workflow.

Thanks, Carl.