veraPDF / veraPDF-library

Industry supported, open source PDF/A validation library
http://verapdf.org/software
GNU General Public License v3.0
268 stars 48 forks source link

veraPDF fails when an embedded file Subtype contains media type parameters #1460

Open petervwyatt opened 1 month ago

petervwyatt commented 1 month ago

In PDF Errata #155 it was decided that for embedded file stream dictionary Subtype entries, Media Type parameters were not prohibited by PDF. Listen to the recording if you want the full gist of the discussions. Since no PDF/A standard makes any further remarks on this (such as prohibiting), PDF/A files may thus contain Media Type parameters as per RFC 2046.

e.g. for an email with "Content-Type: text/xml; charset=UTF-8", the matching Media Type would be: /Subtype /text#2fxml;#20charset=UTF-8 but this fails validation by veraPDF.

It appears that veraPDF is using regex /^[-\w+\.]+\/[-\w+\.]+$/ (such as here). This doesn't account for =, or # AFAICT (Java \w = [a-zA-Z_0-9]).

PS. Found as part of EA-PDF