pdf-association / pdf-issues

Industry-based resolutions for issues and errata reported against any PDF-related specification
https://pdf-issues.pdfa.org/
64 stars 2 forks source link

Encoding of URL in developer extensions dictionary (ISO 32000-2:2020) #238

Open a20dev opened 1 year ago

a20dev commented 1 year ago

Table 49 does not specify the encoding of the URL entry. In other places, e.g., table 238 /("ASCII string") and 7.11.5 ("RFC 3986"), the encoding is specified, but none of these apply to developer extensions dictionaries.

petervwyatt commented 1 year ago

It is currently defined as string so this means that any kind of string from subclause 7.9.2 is permitted. But since this represents a URL any Unicode should be %-encoded so reducing to ASCII string seems appropriate.

MatthiasValvekens commented 1 year ago

But since this represents a URL any Unicode should be %-encoded so reducing to ASCII string seems appropriate.

Do we allow IRIs? If so, this is too restrictive. If not, we should probably clarify that (e.g. with a note saying that IRIs can be represented using percent encoding or punycode).

petervwyatt commented 1 year ago

I think URLs are sufficient for this purpose so adding a note to use percent-encoding would be fine. This then is aligned with 7.11.5, 12.6.4.8, and 14.10.3.2 uses of URLs.

ISO 32000-2 does not mention Punycode or reference the RFC now, so I'd avoid extending.

petervwyatt commented 1 year ago

Leonard wrote a doc for ISO back in 2009/2010 about IRIs, URLs, PunyCode, etc. We will locate this doc and review in the PDF Association to see if a worthwhile TechNote can be made.

lrosenthol commented 1 year ago

Here is the information from the document in question:

We’ve evaluated this issue in the past, as we are aware that it’s a problem with PDF existing in the “modern web”. There are two reasons that we haven’t solved it – 1) language/file format issues and 2) Acrobat/Reader changes.

As far as the file format (PDF language) is concerned, the problem is really around compatibility. Today, a URI can be included in a PDF in the following places: • Base URI entry in the Catalog – URI Dictionary • Navigator UUID – text string • Link Annotation /PA (from Web Capture) – URI Dictionary • URI action – URI Dictionary • URI entry in the URI Dictionary – ASCII string • RichText link – XHTML string/stream • OutputIntent RegistryName – text string

In addition, there are few places in the PDF spec that refers to URLs instead of URIs and specifically references RFC 1738 and 7-bit ASCII (except where noted below). • URL specifications in a FileSpec dictionary – ASCII string (or PDDocEncoding is allowed) • URL entry in Extension dictionary – text string • URL entry in TimeStamp seed value – ASCII string • URL entry in Certificate seed value – ASCII string • Caption entry in PaperMetaData dictionary – text string • Submit action – URL-based file specification • BU entry for MediaClips MH/BE dicts – ASCII string • U key in Software Identifier (for Media) – ASCII string • URL Strings in WebCapture content sets – ASCII string • AU entry in source information dictionary – ASCII string • U and C entries in URL Alias dictionary – ASCII string • URL entry in Web Capture command dictionary – ASCII string • URLs entry in OutputIntent – URL-based file specification

As you can see from this list, almost every place that uses a URL or URI defines it as a 7-bit ASCII string, although there are a limited set of places that happen to allow “text strings”, which are de-fined in either PDDocEncoding (ISO Latin 1) or UTF-16BE.

Although it would be possible to simply change the definition of some/all of the ASCII strings to “text strings” in UTF-16BE and maintain file format compatibility (since a string is a string syntactically) – the fact is that you’d break compatibility with existing readers (from Adobe and elsewhere) who are only expecting those values to be ASCII. This would mean that for some/all of these keys that you wanted to support IRIs you need to create NEW keys where the IRI data would go (plus you’d probably also put in a recommend to have the producer put the URI information in as well).

petervwyatt commented 1 year ago

ISO 32000-2 did not adopt PaperMetaData or the Navigator UUID.

We also added the NS entry in the Logical structure Namespace dictionary (14.7.4.2 and Table 356) as a text string. And there are a few more new PDF 2.0 features that utilize File Spec dictionaries and thus "inherit" URI/URLs via that mechanism: Associated Files and PronunciationLexicon to name just two.

JS (ECMAscript) can also include URI/URLs. And, of course, the entire PDF Fragment Identifier feature in Annex O (but that is not a file format thing).

petervwyatt commented 1 year ago

See also #256