veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
270 stars 48 forks source link

Duplicate MCID in tagged content not identified #1405

Closed jozefbaranec closed 3 months ago

jozefbaranec commented 7 months ago

The attached PDF has a content with two objects with MCID 23 separated by an Artifact. PDF/UA validation does not report this issue. duplicate-mcid.pdf

the content:

/P <</MCID 23>>BDC
q
BT
/F0 10.5 Tf
1 0 0 1 51.02298 340.273041 Tm
[(S)-2(alo)3(n)]TJ
ET
Q
EMC
/Artifact <</BBox[ 51.023 340.126 84.8225 347.728]/Type/Layout>>BDC
q
BT
/F0 10.5 Tf
1 0 0 1 74.395966 340.273041 Tm
(...)Tj
ET
Q
EMC
/P <</MCID 23>>BDC
q
BT
/F0 10.5 Tf
1 0 0 1 80.80098 340.273041 Tm
<94>Tj
ET
Q
EMC
bdoubrov commented 7 months ago

@jozefbaranec thanks for reporting this issue. We've already encountered a number of test files with this issue, but always treated it silently as a minor issue, processing both marked content sequences (with identical MCIDs) as if they belong to the same parent in the structure tree.

I agree this issue seems to be more severe, and some other tools process this case differently. So, we'll report this deviation from ISO 32000-2 as a WARNING log message

bdoubrov commented 6 months ago

Another related issue with broken structure of marked content sequences is when one sequence is contained in another (both having possibly different MCIDs). We'll add this check as well, as we see different implementations handing this violation of the spec in non-consistent ways.

bdoubrov commented 3 months ago

Latest dev version adds log warnings in case of duplicated MCIDs or when one marked content sequence is embedded into another.

bdoubrov commented 3 months ago

Added to the latest veraPDF release 1.26