openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

Only inform about unknown PDF Name prefixes #807

Closed prettybits closed 1 year ago

prettybits commented 1 year ago

Also related to #668, unknown prefixes are currently treated as validation errors, but I'm not sure why. The example files in the linked issue report the following error: ErrorMessage: Unknown Developer Prefix:: ESIC, but I haven't been able to find anything specific about the prefix. It is not currently listed in the official prefix name registry (which I have also updated and sorted alphabetically on this occasion).

I don't currently have access to the formal specification (the link in the comments is being redirected by now), so I'm not sure what specifically it says about prefixes and how to treat them re validity. Can you help me out with some quotes, @carlwilson?

While searching around the codebase for the validation message I also found that it is currently wrongly used for version parsing errors in the PDF header, which I changed to the intended one. If you prefer I can split these commits up to separate PRs, I just combined them for now since they seemed thematically related enough.

samalloing commented 1 year ago

Hi @prettybits

I used error because in the PDF specification it is stated in section 7.12 Extensions Dictionary: "The keys in the extensions dictionary shall be names consisting only of the registered prefixes, described in Annex E, of the developers whose extensions are being used". Annex E is about the Name registry. That's why I used an ErrorMessage instead of an InfoMessage. So the Prefix needs to be registered.

But thanks for updating the registry!

Sam

prettybits commented 1 year ago

Hi @samalloing, thanks for reviewing! I didn't see the preexisting PR for the PDF-HUL155 issue, I will rebase the branch without the commit.

I did in the meantime also find the specification document, so I can follow the official text as well. The rules about developer extensions key names do seem somewhat unclear to me I have to say.

Chapter E.1 "General" says

Developer prefixes shall be used to identify extensions to PDF that use First Class names (see below) and that are intended for public use. (See 7.12.2, “Developer Extensions Dictionary.)

and Chapter I.3 "Feature Compatibility" mentions

See 7.12.2, “Developer Extensions Dictionary” for a discussion of how to designate the use of public extensions in PDF file.

which makes it seem like this is more about "public" extensions, although it also sounds like per E.2 private extensions should then still use a XX prefix. I'm not sure that treating unregistered prefixes as a validity error is really helping in practice (I assume it is quite likely that there exists proprietary? reading software that knows how to treat the improperly named private extension), but I suppose strict conformance supports the current handling.

Shall I rebase the PR with just the registry update commit intact?

prettybits commented 1 year ago

I just noticed as well that 7.12.4 uses "GLGR" as an extension key in its examples, which is not even in the registered name list, don't know what to make of that.

I also wonder if @tballison may have thoughts about this?

carlwilson commented 1 year ago

OK, I've been working my way through the earlier PDF module PRs but these are on my radar now. I'm going to raise one or two of these questions with our resident PDF expert and will add any answers here. I'd like to get this merged and working for the upcoming 1.28 release.

prettybits commented 1 year ago

Thanks, @carlwilson!

In the meantime I have rebased my branch, removed the now redundant fix and re-added the "ADBE" prefix which I noticed I mistakenly left out previously.

tballison commented 1 year ago

I also wonder if @tballison may have thoughts about this?

I defer to digipres on use cases for digipres and actual pdf experts on the spec.

Personal opinion only: I'd want to be alerted to public extensions, private extensions and non-standard extensions. For some use cases, I imagine that you'd not want to allow any extensions because of the risk of data leakage and or vendor-specific behavior that can't be replicated.

carlwilson commented 1 year ago

I've read the specification document and I'm in agreement with your reading @prettybits in that unregistered prefixes should begin xx but if they do then the document isn't invalid. I think I know the issue with GLGR as well. I think that it's this company https://whattheythink.com/news/18038-global-graphics-launches-jaws-pdf-library-com-edition/, but that link is 2003. Reading on the Adobe names list GitHub site here: https://github.com/adobe/pdf-names-list#additional-details it says

NOTE: Due to legal reasons, Adobe maintains the list of all registrations that occured prior to that 2008.

I suspect this is one of those, which raises the possibility of registered names that aren't on the list. I'm starting to favour a warning for unknown prefixes and xx ones but not marking the document as invalid.

I've asked someone who knows more about PDF than me to give some advice. I also noticed we're one registration light for "DLLB" added on Dec 12 2022. By coincidence, it's the person I've asked to help.

codecov[bot] commented 1 year ago

Codecov Report

Base: 29.18% // Head: 46.22% // Increases project coverage by +17.04% :tada:

Coverage data is based on head (291f7e6) compared to base (0dba54e). Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## integration #807 +/- ## ================================================== + Coverage 29.18% 46.22% +17.04% - Complexity 597 1056 +459 ================================================== Files 57 57 Lines 9057 9057 Branches 1607 1607 ================================================== + Hits 2643 4187 +1544 + Misses 6033 4330 -1703 - Partials 381 540 +159 ``` | [Impacted Files](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve) | Coverage Δ | | |---|---|---| | [.../java/edu/harvard/hul/ois/jhove/ConfigHandler.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvQ29uZmlnSGFuZGxlci5qYXZh) | `61.41% <0.00%> (+0.78%)` | :arrow_up: | | [...ava/org/openpreservation/jhove/ReleaseDetails.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL29yZy9vcGVucHJlc2VydmF0aW9uL2pob3ZlL1JlbGVhc2VEZXRhaWxzLmphdmE=) | `82.75% <0.00%> (+1.72%)` | :arrow_up: | | [...n/java/edu/harvard/hul/ois/jhove/PropertyType.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvUHJvcGVydHlUeXBlLmphdmE=) | `100.00% <0.00%> (+4.76%)` | :arrow_up: | | [...rvard/hul/ois/jhove/messages/JhoveMessageImpl.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvbWVzc2FnZXMvSmhvdmVNZXNzYWdlSW1wbC5qYXZh) | `21.42% <0.00%> (+4.76%)` | :arrow_up: | | [...a/edu/harvard/hul/ois/jhove/NisoImageMetadata.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvTmlzb0ltYWdlTWV0YWRhdGEuamF2YQ==) | `76.19% <0.00%> (+9.32%)` | :arrow_up: | | [.../java/edu/harvard/hul/ois/jhove/PropertyArity.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvUHJvcGVydHlBcml0eS5qYXZh) | `100.00% <0.00%> (+10.00%)` | :arrow_up: | | [...in/java/edu/harvard/hul/ois/jhove/HandlerBase.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvSGFuZGxlckJhc2UuamF2YQ==) | `71.42% <0.00%> (+11.56%)` | :arrow_up: | | [...src/main/java/edu/harvard/hul/ois/jhove/Agent.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvQWdlbnQuamF2YQ==) | `46.82% <0.00%> (+13.49%)` | :arrow_up: | | [...main/java/edu/harvard/hul/ois/jhove/JhoveBase.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvSmhvdmVCYXNlLmphdmE=) | `41.29% <0.00%> (+18.02%)` | :arrow_up: | | [...c/main/java/edu/harvard/hul/ois/jhove/RepInfo.java](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve#diff-amhvdmUtY29yZS9zcmMvbWFpbi9qYXZhL2VkdS9oYXJ2YXJkL2h1bC9vaXMvamhvdmUvUmVwSW5mby5qYXZh) | `54.61% <0.00%> (+21.53%)` | :arrow_up: | | ... and [24 more](https://codecov.io/gh/openpreserve/jhove/pull/807?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openpreserve)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

prettybits commented 1 year ago

Good hunting, @carlwilson, Global Graphics also looks like the right candidate to me for the GLGR prefix. In the names list there is one entry that has a listed creation date before 2008: Plib for PDFlib GmbH, and I'm also not quite sure how to parse that note by Adobe. It sounds like (almost all) those names are known to Adobe only and cannot be publically listed in the repository? It's quite certain to not be a complete list in any case it would seem.

I also added the DLLB prefix now, that was commited a few hours after I opened this PR. ;)

As to validity, I think that even if the prefix names list were complete, unless we knew how to validate what the extension actually means to extend we wouldn't really win anything by being strict. I agree with @tballison that being notified about any extension no matter how official can be very valuable, but that's not a question of validity but more of policy. But that is personal opinion territory, let's see what Peter says. :)

carlwilson commented 1 year ago

I've now had confirmation that not all of the names on the pre-2008 list are public, so there are legitimate prefixes that aren't on the list. A better explanation of the implications of the message on the Wiki and an informational message is the way to go here. I will merge this and add some better background to the Wiki. Thanks for the PR @prettybits