veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
268 stars 48 forks source link

Make logging system more useful for end users #1445

Open MaximPlusov opened 3 months ago

MaximPlusov commented 3 months ago

Originally posted by @ozross in https://github.com/duallab/ngPDF/issues/2#issuecomment-2067394063 and https://github.com/duallab/ngPDF/issues/2#issuecomment-2071152016

In a picture sent there are warnings about duplicated dictionary keys. The object ID given is to the /StructTreeRoot dictionary, which seems rather strange. I've traced these to be resulting from the same name being used as a key in both the /RoleMap and /ClassMap dictionaries, which surely is valid though maybe not best practice.

Screenshot 2024-04-18 at 11 30 13 am

By changing the /RoleMap entry to a relative reference to a separate dictionary object, and similarly for the /ClassMap entry, the warnings no longer occur. Previously these dictionaries were given as direct entries of /StructTreeRoot .

Is there a lesson to be learned here, that could/should be shared in some documentation?

FallMT2022-Jul28.pdf

Here are the warnings:

Apr 23, 2024 7:04:58 AM org.verapdf.parser.COSParser getDictionary
WARNING: Dictionary/Stream contains duplicated key /CRDclause(object key = 470 0 obj, offset = 281158)
Apr 23, 2024 7:04:58 AM org.verapdf.parser.COSParser getDictionary
WARNING: Dictionary/Stream contains duplicated key /onPages(object key = 470 0 obj, offset = 281217)
Apr 23, 2024 7:04:58 AM org.verapdf.parser.COSParser getDictionary
WARNING: Dictionary/Stream contains duplicated key /NOAAtype(object key = 470 0 obj, offset = 281676)
Apr 23, 2024 7:04:58 AM org.verapdf.parser.COSParser getDictionary
WARNING: Dictionary/Stream contains duplicated key /CRDcitation(object key = 470 0 obj, offset = 281776)
Apr 23, 2024 7:04:58 AM org.verapdf.parser.COSParser getDictionary
WARNING: Dictionary/Stream contains duplicated key /CRDfishimages(object key = 470 0 obj, offset = 281807)
Apr 23, 2024 7:04:58 AM org.verapdf.parser.COSParser getDictionary
WARNING: Dictionary/Stream contains duplicated key /PRPcomment(object key = 470 0 obj, offset = 281844)
Apr 23, 2024 7:05:01 AM org.verapdf.gf.model.factory.operators.OperatorParser parseOperator
WARNING: Content stream contains duplicate MCID - 1
Apr 23, 2024 7:05:01 AM org.verapdf.gf.model.factory.operators.OperatorParser parseOperator
WARNING: Content stream contains duplicate MCID - 2
Apr 23, 2024 7:05:01 AM org.verapdf.gf.model.factory.operators.OperatorParser parseOperator
WARNING: Content stream contains duplicate MCID - 3

Here is a picture displaying some of what I think is happening — but it doesn't indicate or explain all of it.

FallMT2022-duplicated-keys

There are 3 keys that are used in both the RoleMap and ClassMap: /CRDclause , /CRDfishimages , /PRPcomment one key that differs in the case of a single letter: /CRDcitation role, as opposed to /CRDCitation class with no duplication for 2 others: /onPages and /NOAAtype .

Indeed the latter /NOAAtype is not used at all within the structure tree, except as a title NOAAtype of 2 different objects: top.01 and top.05 objects 551 and ??? respectively.

Hope this helps.

MaximPlusov commented 3 months ago

In this document the RoleMap contains duplicated entries:

 /CRDclause /Span
 /CRDclause /P
 /onPages /Div
 /onPages /Reference
 /NOAAtype /P
 /NOAAtype /P
 /CRDcitation /P
 /CRDcitation /P
 /CRDfishimages /Div 
 /CRDfishimages /Div
 /PRPcomment /Para
 /PRPcomment /Div

verapdf and Acrobat using the value that was found later

MaximPlusov commented 3 months ago

Originally posted by @ozross in https://github.com/duallab/ngPDF/issues/2#issuecomment-2073556947

OK. That is a simple explanation ...

... and I now know the way to prevent it from happening within my LaTeX processing.

But it begs the question of how I could have found this for myself. When listing the RoleMap in Acrobat, or its Preflight utility, it shows only 1 entry in a list sorted alphabetically — which ordering is not how it appears within the PDF itself.

What software do you use to see the RoleMap, ClassMap and other internal code, in a compressed PDF ? Is it free, for Unix/Linux/MacOS ? Or relatively inexpensive ?

MaximPlusov commented 3 months ago

I don't know such programs. I used the veraPDF debugging to explore this case.

MaximPlusov commented 3 months ago

Originally posted by @ozross in https://github.com/duallab/ngPDF/issues/2#issuecomment-2095087287 and https://github.com/duallab/ngPDF/issues/2#issuecomment-2112138406

Great; that's an option of which I was not aware. How does one turn it on? I'm using the GreenfieldGuiWrapper . I can see how to adjust Settings, such as the Logging Level, and the Features Config check-box marks (not sure what these give). But don't see anything more detailed for debugging.

With Logging set to ALL I'm getting messages such as: FINE: Can't get PSObject for COSType COS_UNDEFINED
from getPSObject and FINE: Unknown ColorSpace name from getColorSpaceFromName and FINE: java.lang.NumberFormatException: For input string "" from readNumber

Yet the Compliance is Passed (for PDF/UA-1). If Logging is set as anything else, these messages do not occur; so I'm guessing that they aren't really important. Nevertheless it would be nice to see just what they refer to, and where it occurs. Maybe I'm setting a null string () instead of 0 ?

I'd like to learn how to diagnose these completely, even if it isn't crucial.

FallMT2023 2.pdf

The image below shows that the PDF validates for both PDF/UA-1 and PDF/A-3a, but there are a significant number of messages written to the shell window, from which the GUI interface was launched.

Screen Shot 2024-05-15 at 8 00 41 pm

After unchecking everything in that "Features Config" there are still many messages; so I cannot tell whether any of those settings were relevant. Probably not. Also, I realise now that I can run veraPDF from a command-line shell, and that there's a --debug option. But so far I've not been able to use it to explore the cause of these messages.

Cheers.

  Ross
MaximPlusov commented 3 months ago

@ozross All these messages are incorrect and we will try to remove it: FINE: Can't get PSObject for COSType COS_UNDEFINED from getPSObject (already disabled) and FINE: Unknown ColorSpace name from getColorSpaceFromName (connected with using Indexed color space) and FINE: java.lang.NumberFormatException: For input string "" from readNumber (connected with '-|' inside Type1 Font Private Part) Option --debug used only for showing all processed file names