openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
169 stars 79 forks source link

Handle filter params as indirect objects #871

Closed prettybits closed 7 months ago

prettybits commented 1 year ago

Similar to #672 where filters as indirect objects weren't handled, the same issue also still exists for filter parameters.

I came across this when trying to validate this document (disregard the non-standard file extension for now), which has been originally created by a Xerox AltaLink C8030 scanner. The current JHOVE release detects this as not well-formed with a "Malformed filter" error (PDF-HUL-45).

I tracked this down to the decode parameter for a DCTDecode filter being an indirect object, which wasn't handled in the relevant codepath so far. The indirect object resolves to an empty dictionary, which seems quirky but the way I'm reading the spec there doesn't seem to be an explicit provision against that.

While writing the fix I also added an explicit check against null since null objects are defined as either the simple "null" string or an indirect object pointing to nothing. I get the feeling there probably should be a more generic handling of indirect and null objects all throughout the PDF module?

prettybits commented 1 year ago

I don't know why the two test files that fail the integration test are named after PDF-HUL-26, when testing with 1.28 these fail with PDF-HUL-45 due to the same issue of indirect filter parameter objects so it makes sense that these report differently now.

carlwilson commented 7 months ago

I don't know why the two test files that fail the integration test are named after PDF-HUL-26, when testing with 1.28 these fail with PDF-HUL-45

I believe that JHOVE was falsely reporting PDF-HUL-26 for these files when they were added to the corpus. Thanks for the fix.