openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

PDF-HUL-45 : What logic is being checked for malformed filters? #971

Open asciim0 opened 1 week ago

asciim0 commented 1 week ago

I'm curious what logic is actually being checked for PDF-HUL-45 error messages. I very much appreciate the fact that filter arrays are now supported and no longer throw an error, however, it seems that most false manipulations I conduct to filter dictionaries pass validation as well.

Please see attached file to try put various dictionary manipulations, e.g.: The obj should be (and currently is): malformednew.pdf

22 0 obj << /BitsPerComponent 8 /ColorSpace 23 0 R /Filter /DCTDecode /Height 1042 /Name /X /Subtype /Image /Type /XObject /Width 736 /Length 114577 >>

Changing for example the filter from /DCTDecode to /DXTDecode or something else fictive, still results in a well-formed and valid file.

Could you tell me what exactly JHOVE is checking in a filter dictionary?

malformednew.pdf

samalloing commented 1 week ago

Hi @asciim0 ,

I made a quick look at filters in the PdfStream.java file. The structure is checked, but not the name of the filter itself (also the decode parameters are stored). If there is a complete list of possible filters, then that would be easy to add I think.

Sam

asciim0 commented 1 week ago

Could you please elaborate on what you mean by "the structure is checked"? what does that include? mandatory keys for all filters? for some?

samalloing commented 1 week ago

Just a simple check for the PdfObject for example PdfArray or a PdfSimpleObject

asciim0 commented 1 week ago

I'm sorry, I still don't understand what that simple check means. You mean it just checks if an array is a correct array? Could you give a logic translation of the code for checking filters per chance?

samalloing commented 6 days ago

Hi Micky,

Sure the java code translates the PDF entities to java objects. So for example you have array that is a PDF array. So what the code does, it implements what type of PDF entity is allowed in this specific case a Filter, can be a PDF Object or a PDF array. A Filter can also be an indirect Reference. That was not implemented at first so this gave an error (PDF-HUL-45) until it was added. What the current code does is test if the filter is a PDF Object, PDF array or an indirect reference. If in a PDF something else like I don't know a dictionary, there will be an error. It will also check if the array is correct indeed.

Hope this makes it clear

Sam

asciim0 commented 5 days ago

Just to make sure we're talking about the same thing here: What is being checked is the value of the Key Filter, right? As per spec (ISO 32000-2:2020, sect 7.4, Table 5 that can be:

Does that align with what is being checked? Or isn't it the value of the /Filter at all that is being checked? What I'm trying to do is trigger the PDF-HUL-45 rule by manipulating a file containing a filter ... but I can change the value to pretty much whatever I want to (nothing, integer, indirect reference) and the file is still validated as well-formed and valid.

samalloing commented 4 days ago

Sure! No the value of the filter is not check. What is checked if it is "an array of zero, one or several names (of filter(s)". And if it is a PDF Object or an Indirect reference. But this is just the structure of the PDF. What I mean in your example a filter is at "17 0 obj". That is the only thing that is checked. If you want to trigger PDF-HUL-45. I'll send you an example.

asciim0 commented 2 days ago

I took a look at the file that Sam shared with me. It seems that what triggers the error is the indirect reference leading to an error. The error was thrown at the end of obj 183: 183 0 obj [/ASCII85Decode /LZWDecode]

obj 183 is referenced by obj 184: 184 0 obj <</Filter 183 0 R /Length 185 0 R>> stream

As far as I understand the spec, arrays (like all objects) can be represented by indirect objects and filter values can be names or arrays ... and therefore also indirect objects. The syntax of the array looks fine. I therefore believe that there is still a possible case of a false positive for this error, as shown here.

I also still don't understand what the malformed filter then checks :-P