veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
274 stars 48 forks source link

Inconsistent widths #842

Closed a20god closed 7 years ago

a20god commented 7 years ago

veraPDF 1.7.63 (and older) claims that this document violates 6.2.11.5 of ISO 19005-2:2011:

tmp40.pdf

Three other PDF/A validators believe that the document is compliant.

a20god commented 7 years ago

This might be related to https://github.com/veraPDF/veraPDF-library/issues/834.

a20god commented 7 years ago

This one might be simpler to analyze as it uses a subsetted font:

tmp42.pdf

Again, the document is compliant according to three other PDF/A validators.

a20god commented 7 years ago

I think COSStream.concatenateStreams() is broken, for short content streams it puts lots of NUL characters into the temporary file. writeStreamToFile() ignores the number of bytes returned by stream.read(tmp) (variable "read") and always writes the complete array even if it has been filled only partially. This will become a real problem for content streams which are longer than 2048 bytes as the byte array won't contain NULs for the last iteration..

a20god commented 7 years ago

Example: concat1.pdf

a20god commented 7 years ago

Also test with this one: concat2.pdf

shem-sergey commented 7 years ago

Thank you, that is a severe error indeed.

a20god commented 7 years ago

Note that there is an implicit "token separator" between the streams of the array. concat2.pdf demonstrates that. Inserting a space between streams probably won't work in certain pathological cases.

a20god commented 7 years ago

My content stream parser treats the end of a stream in a Contents array as EOF as far as tokenization is concerned and then moves on to the next stream. That is, it does not really concatenate the streams.

shem-sergey commented 7 years ago

I think that simple concatenation of streams is an appropriate solution as this is exactly what is said in specification.

a20god commented 7 years ago

Well, the specification says

The division between streams may occur only at the boundaries between lexical tokens

Note that PDF 2.0 clarifies how to concatenate:

If the value is an array, the effect shall be as if all of the streams in the array were concatenated with at least one white-space character added between the streams’ data, in order, to form a single stream.