Embedded File Stream dictionary (table 44) Subtype definition is bad.

datalogics-pgallot commented 2 years ago

In the Subtype entry of Table 44 (section 7.4.11.1), we have: "The value of this entry shall conform to the MIME media type names defined in Internet RFC 2046, with the provision that characters not permitted in names shall use the 2-character hexadecimal code format described in 7.3.5, "Name objects"."

The problem with this is that RFC 2046 defined (in 1996) an initial hierarchy of media type names with a note that: "It should be noted that the list of media type values given here may be augmented in time, via the mechanisms described above, and that the set of subtypes is expected to grow substantially."

The normative grammar of the media types was actually given in RFC 2045, however, that has since been restricted somewhat in section 4.2 of RFC 6838 "Media Type Specifications and Registration Procedures".

Current best(?) practice with PDF is to use both IANA-registered media types: https://www.iana.org/assignments/media-types/media-types.xhtml But also not-currently-registered media types like: "model/u3d" and "model/prc" Such unregistered (or perhaps pre-registered) media types should probably conform to the RFC6838 registration requirements to the extent possible rather than RFC2046.

Also note that revision of this entry should take RFC6657 into consideration, without which non-UTF8 plain-text attachments are just bit-rot.

petervwyatt commented 2 years ago

We are in process of registering model/prc and model/u3d with IANA. model/step and friends is already registered by ISO TC 184 SC 4.

petervwyatt commented 2 years ago

I think this issue now reduces down to whether PDF requires, prohibits, recommends, stays silent, or has any informative notes on the use/need or default value for charset parameters with textual MIME media types.

Since RFC2046 specifically describes the charset parameter I would assume it is therefore permitted in PDF from the current wording Table 44. RFC2046 also clearly states the default value as US-ASCII (https://datatracker.ietf.org/doc/html/rfc2046#section-4.1.2) but then RFC6657 has gone and changed this recommendation (https://datatracker.ietf.org/doc/html/rfc6657#section-3). I don't believe therefore we can change PDF to switch to RFC6657 since then the charset of existing PDF without an explicit charset parameter would also be changed.

So maybe the best solution is an informative note acknowledging RFC6657, but stating that for backwards compatibility reasons the default charset used in PDF remains "US-ASCII" as stated in RFC2046. And if new files want a UTF-8 default like RFC6657 recommends, then they will have to add it explicitly.

MatthiasValvekens commented 2 years ago

I don't believe therefore we can change PDF to switch to RFC6657 since then the charset of existing PDF without an explicit charset parameter would also be changed.

Just to play devil's advocate: is it really that problematic to adopt the UTF-8 default from RFC 6657 going forward? Given that all valid US-ASCII data decodes to the same thing under UTF-8, I don't think existing (properly encoded!) files would be affected. If anything, we'd be making a bunch of currently invalid files valid.

I believe the potential concern would rather be with older processors assuming that something is ASCII when it's really UTF-8, but that (to me) feels like much less of a problem than invalidating existing data out there.

Anyway, I don't want to bikeshed this too much, and I'm OK with just adding a note to acknowledge RFC 6657, but I wouldn't object to just changing the default either. :)

lrosenthol commented 2 years ago

@lrosenthol will investigate this

petervwyatt commented 2 years ago

PDF TWG discussed: there is no situation we can identify where the encoding of an embedded text file needs to be known by the PDF processor and thus the charset is not required by PDF.

petervwyatt commented 3 months ago

Re-opening this Errata since the previous resolution did not fully address the PDF syntactic requirements of what Subtype can or cannot contain. The previous resolution agreed that:

the RFC reference should remain unchanged (to RFC 2046), and
any parameters (such as charset=) are not required by PDF processors. (And noting that there are NO processor requirements anywhere in 32K related to Subtype)

No disagreements there.

Point (2) above seems to presume that parameters therefore might be present in Subtype (suitably encoded into a PDF name object) but several validators reject such PDFs (primarily due to the presence of a SEMI-COLON AFAICT). The ISO 32K phrasing "The value of this entry shall conform to the MIME media type names defined in Internet RFC 2046..." uses the phrase "media type name" which does not occur in RFC 2046...

RFC 2046 Section 2 clearly states that:

The definition of a top-level media type consists of:
    (1)   a name and a description of the type, including
          criteria for whether a particular type would qualify
          under that type,
    (2)   the names and definitions of parameters, if any, which
          are defined for all subtypes of that type (including
          whether such parameters are required or optional),
     ...

where name as used in RFC 2046 is a primitive lexical construct occurring in both type and parameters (and subtype ("description" here) as mentioned elsewhere in RFC 2046 (and clearly nothing to do with PDF names). Thus a more accurate wording for ISO 32000-2 would be to simply use "media type" and delete "name" since that is an incorrect and confusing term of art from RFC 2046.

Thus “text”, “text/plain”, “text/foo”, “text/plain; charset=utf-8” and “text/plain; charset=utf-8; foo=bar; bar=foo” when encoded as PDF names would all pass the "The value of this entry shall conform to the MIME media type ~~names~~ defined in Internet RFC 2046..." requirement.

So proposed solution is to delete "name".

petervwyatt commented 3 months ago

PDF TWG would like more input from others - review again next time.

lrosenthol commented 3 months ago

In this case, PDF does not support the parameters. So any change would be to make it clear that parameters are not permitted for this value - it is strictly the "media type".

petervwyatt commented 1 month ago

After discussions, PDF TWG appreciate that the media type without parameters will cause field issues for some. Need to define words to clarify current limitations. Possibly consider new key for media type parameters.

petervwyatt commented 1 month ago

Proposed wording to make things very clear as to what syntactically seems to work in PDF apps vs not work:

The value of this entry shall conform to a subset of the MIME media type as defined in Internet RFC 2046, section 2. This entry shall only include the top-level media type and its description separated by a SOLIDUS (2Fh) (/), and shall not include SEMI-COLON (3Bh) (;), EQUALS (3Dh) (=), NUMBER SIGN (23h) (#), or any media type parameters or sub-parameters. Additionally, characters not permitted in PDF name objects shall use the 2-character hexadecimal code format described in 7.3.5, "Name objects".

pdf-association / pdf-issues

Embedded File Stream dictionary (table 44) Subtype definition is bad. #155