Mechanism for encoding *direction* metadata may need more work

aphillips commented 3 years ago

6.4.2. Language and Direction Encoding https://www.w3.org/TR/webauthn-2/#sctn-strings-langdir

The second consists of a single code point which is either U+200E (“LEFT-TO-RIGHT MARK”), U+200F (“RIGHT-TO-LEFT MARK”), or U+E007F (“CANCEL TAG”). The first two can be used to indicate directionality but SHOULD only be used when neccessary to produce the correct result. (E.g. an RTL string that starts with LTR-strong characters.) The value U+E007F is a direction-agnostic indication of the end of the language tag.

The mechanism for indicating base direction makes the I18N working group concerned for multiple reasons:

This is a unique and thus unproven encoding mechanism. It requires string introspection that would likely produce errors, especially since authenticators are expected consume these strings naively.
Separate bidi metadata fields are preferred to inline metadata (see #1643).
Bidi metadata values are preferred to using bidi control characters as the actual value. We recommend using strings such as ltr or rtl (and appropriately decorated as metadata or set off from the content). Using strings makes the value visible when editing the content and easier to debug, vs. invisible controls.
RLM/LRM are strongly directional characters and should precede the string, as they would result in many cases in the correct rendering. This is independent of whether the language identification should appear at the start or end.
- If the RLM/LRM appears at the end of the string, replacing the CANCEL tag character, they might impact the display of any string immediately concatenated onto a naive display of the value.
- Because RLM/LRM are normal bidi controls, if added to the start of the string, it is impossible to determine if they are part of the data or were added by the implementation. There is potential danger that implementations would add extra characters as a result.

Regardless of the direction metadata mechanism, this section should include a health warning to consumers to present language data in a bidi-isolating context.

agl commented 3 years ago

A unique encoding should be expected here since the context is quite bespoke: a single binary field, stored on limited external hardware that can truncate the string at an arbitrary byte boundary >= 64 bytes.

The reason for putting the RLM/LRM at the end is to a) clearly mark them as not part of the original string and b) to provide truncation indication as noted in https://github.com/w3c/webauthn/issues/1645

I said on the call of 2021-07-14 that I would revert the changes to this section but, having review the issues filed, I no longer think that's the correct direction (no pun intended). I don't see clear alternatives presented that conform with the limitations of the context. It's completely understood that it would be nice to have separate fields for this metadata, but that's not the reality that we're faced with. There are millions of security keys out in the world that don't work that way.

Some of the filed issues note non-breaking screwups in the description that should be fixed. But I don't yet see a better idea for the overall structure of encoding this information.

aphillips commented 2 years ago

In an email reply to @wseltzer I remarked:

When we last spoke, I took an action item to produce a PR with suggested text. I prepared that PR and you can find the text of it in my fork of webauthn . The only proposed requirement that my design doesn't address is a terminating character or sequence to indicate if truncation has occurred. The serialization suggested was taken from JSON-LD (in an attempt to avoid a proliferation of serialization schemes in the world).

I brought this proposal to the I18N WG and the feeling of the working group was that we shouldn't be proposing novel methods of encoding language and direction--that instead we should provide guidelines and then let the working group address that with text. The WG feels that separate metadata fields are, of course, preferable, but we understand why that's probably not possible.

In response to @agl's comments above, I could see using the RLM/LRM character instead of the ASCII encoding I propose as it would serve as both a direction indicator and truncation marker. These two characters are just strongly-directional invisible characters and so don't bring any display tampering risks. On the other hand, they have to be mapped, rather than just applying them to fields (that often expect the ASCII sequence).

The reason I suggest using ASCII sequences for separators and language tags is that, due to UTF-8's encoding characteristics, they are the most compact representation. We've previously discussed why postfixing the values is better for constrained storage devices. Any implementing system would have to understand the format and remove the additions (the fallback for older systems of course is to display the sequences as "garbage"). Alternative separators to my proposal of ^^^ would be acceptable. Some 3-byte Unicode code points might be good for this, notably U+FFFC (object replacement character) or perhaps the BOM (U+FEFF).

Ultimately, encoding metadata inside your strings is less satisfying than encoding it in metadata fields or in a data structure meant for the purpose. However, retrofitting such to an existing spec is hard.

agl commented 2 years ago

Thanks for that and sorry for the delay—I've been on vacation.

I think we want a truncation indicator so that we know when to disregard the trailing metadata. If the RLM/LRM/CANCEL TAG is acceptable then that certainly works. Else any terminator, like a period, at the end would do. Or else a non-opinionated alternative to "rtl"/"ltr" that could always be included when no direction override was needed would serve. (E.g. "?".)

The "^^^" separator would suffer from truncation issues too and could lead to stray "^" characters appear in the UI. Switching "^^^" for the BOM, as suggested, would address that.

aphillips commented 1 month ago

See comment

w3c / webauthn

Mechanism for encoding direction metadata may need more work #1644

w3c / webauthn

Mechanism for encoding *direction* metadata may need more work #1644

Mechanism for encoding direction metadata may need more work #1644