w3c / i18n-activity

Home pages, charters, style-guides, and similar documents related to the W3C Internationalization Activity.
65 stars 22 forks source link

[WoT Profile] Unclear character set constraints and non-UTF-8 html #1664

Open himorin opened 1 year ago

himorin commented 1 year ago

This is a tracker issue. Only discuss things here if they are i18n WG internal meta-discussions about the issue. Contribute to the actual discussion at the following link:

§ https://github.com/w3c/wot-profile/issues/386

aphillips commented 1 year ago

Note that a number of the Media Types you mention are already constrained to use UTF-8 and do not require (in some cases allow) a charset parameter.

Is your comment:

All non-binary formats shall have constraint of charset as UTF-8.

... meant to be a suggestion to add to the quoted paragraph?

himorin commented 1 year ago

Actually, a table has Constraint column, and some of which have charset=UTF-8 specifically. I understand that html spec (WHATWG) limits to utf-8, and RFC 8259 states no charset is registered for json mime type, but reading 6.6.1, I'm really not sure whether current writing / description is appropriate or friendly to reader of specification (e.g. just have a line as 'UTF-8 is mandatory for all payloads')...

himorin commented 1 year ago

@aphillips thank you for your (and WG's) comments during call. I'm still wondering how to write the last line (actually), but how about edited text?

aphillips commented 1 year ago

@himorin Thanks for working on this.

For the table I would change this:

Relation-Type Constraint Remarks
service-doc human readable documentation, supported formats are Unicode Text, markdown, HTML and PDF.

to use the remarks more clearly:

Relation-Type Constraint Remarks
service-doc supported media types are: text/plain, text/html, text/markdown and text/pdf Human readable documentation

And I would go on to add a paragraph under the table:

The types text/plain, text/html, and text/markdown MUST include a charset parameter (for example, text/plain;charset=utf-8) and the linked files MUST use the UTF-8 character encoding. The type text/pdf uses Unicode in its encoding.

Note well: RFC2854 defines text/html and is not obsolete. When the charset parameter is missing, the default encoding is Latin-1 (and specifically iso-8859-1). In practice browsers treat Latin-1 as windows-1252 and HTML5 sniffs the encoding in various ways (weighted towards trying to find UTF-8). However, it is still a good idea to use charset=UTF-8.

Annoyingly, the definition for type text/markdown in RFC7763 is actually unhelpful, but it requires a charset parameter and does not make UTF-8 (or any other encoding) the default because (and I quote):

[...] its syntax rules operate on characters (specifically, on punctuation) rather than code points. Many Markdown processors will get along just fine by operating on characters in the US-ASCII repertoire (specifically punctuation), blissfully oblivious to other characters or codes.

Therefore, in 6.6.2 I would include the charset=UTF-8 on all three of the first rows. I would then add a similar paragraph to the one in 6.6.1 saying approximately:

The types text/plain, text/html, and text/markdown MUST include a charset parameter (for example, text/plain;charset=utf-8) and the linked files MUST use the UTF-8 character encoding. The types application/json, and application/ld+json are already restricted to UTF-8. The type text/pdf uses Unicode in its encoding. Binary types, such as image/jpeg or application/octet-stream, do not have a character encoding associated with them or define the encoding internally.

himorin commented 1 year ago

@aphillips Thank you for deep consideration. I've thought of that style of table a bit, but haven't went to that direction since that overlaps with next table... If we are to propose adding media types into a table of link relation, I'd rather propose to merge two, something like:

Relation-Type Supported Media Types Constraints Remarks
icon image/png, image/jpeg
service-doc text/plain, text/html, text/markdown, text/pdf Linked files MUST use the UTF-8 character encoding. Human readable documentation

Keeping two separated tables, both of which contain similar information (mime types), could be confusing for readers, and also difficult to compile information. With the last paragraph in @aphillips comment, attached below the integrated table, seems to be easier to tell all at one time.

himorin commented 1 year ago

ahhh, in addition to utf-8 as mandatory, do we need to change optional for hreflang into required for text/plain and text/markdown with service-doc and blank for anything else?

himorin commented 1 year ago

@aphillips how about this??


Section 6. Links is not clear and unorganized on several points:

  1. Link relation type is strongly connected with media types as constraints, but these mime types have additional constraints to these, which results in scattered descriptions and writings of specification.
  2. Constraint for service-doc link relation type is written as

human readable documentation, supported formats are Unicode Text, markdown, HTML and PDF.

but wording Unicode is not clear. Considering restrictions placed at mime types, it should be clearly stated with UTF-8 is mandatory over all applicable types.

  1. hreflang is marked as optional, but should be mandatory for text/plain, text/markdown, and possibly on text/html.

We would propose to rewrite this section into one table for clarification and ease for noticing all of constraints with reorganizing all of attached text for description totally, something like:

Relation-Type Supported Media Types Constraints Remarks
icon image/png, image/jpeg
service-doc text/plain, text/html, text/markdown, text/pdf Linked files MUST use the UTF-8 character encoding. hreflang is mandatory for text/plain and text/markdown Human readable documentation.
himorin commented 1 year ago

hi @aphillips , could you kindly take a time to have a look on this??