w3c / i18n-activity

Home pages, charters, style-guides, and similar documents related to the W3C Internationalization Activity.
67 stars 23 forks source link

input type=email change proposals #607

Closed r12a closed 2 years ago

r12a commented 6 years ago

4.10.5.1.5 E-mail state (type=email) https://html.spec.whatwg.org/#e-mail-state-(type=email)

The W3C HTML 5.3 spec has already made changes to the scope of type=email forms. For history on those discussions see

This also relates to a bug at https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489

I'm raising this issue to gather thoughts about what we should propose the WhatWG should change in their version of the spec, which doesn't include the W3C changes.

ICANN is asking for the restriction on Unicode in email to be removed, because there are users out there who have functioning email addresses which use Unicode on either or both sides of the @ sign. See a white paper by them. I believe they are also concerned that content developers may use type=email forms for id entry, where ids are EAI addresses.

Currently the WhatWG spec limits the internal representation of emails input using type=email to ASCII only. Note that the clue to this being the internal representation only is the link on the word 'value', which points to a part of the spec that says:

"A control's value is its internal state. As such, it might not match the user's current input." https://html.spec.whatwg.org/#concept-fe-value)

As i understand the WhatWG spec, the user can type anything into such a field and the browser should use punycode to convert non-ASCII characters, on both sides of the @ sign, to ascii for storage and transmission. The spec also says "Constraint validation: While the user interface is representing input that the user agent cannot convert to punycode, the control is suffering from bad input."

The problems i see with the current spec text are as follows:

  1. punycode is only a relevant transformation for the domain name, not for Unicode text on the left side of the @. I think this needs to be clarified in the spec.

  2. Furthermore, conversion of the left side is not mentioned, although some transformation is apparently required in order to convert Unicode characters to ascii internally. (Given that the spec specifically mentions a punycode transformation for the IDN (which is useful because it is a standardised approach), it seems to me that it would be equally useful to specify the transformation to be applied to the left side for conversion to ascii (eg. percent-encoded utf-8), if ascii is actually needed.)

  3. I still have a question in my mind about whether it is actually necessary, or indeed appropriate, to transform the left-hand side to ascii. I don't know enough about email addresses to answer that question.

  4. It seems that in general browsers are not following the spec, since they are not behaving as expected if the user types email addresses containing Unicode into the form field. During TPAC i created some small tests[1] that show browsers preventing users actually using Unicode in email addresses for type=email fields. Presumably, one of two things should be done in that case: (a) change the spec to match browser behaviour, or (b) raise bugs against the browsers to get them to conform to the spec. The former approach would take us in the opposite direction from what ICANN wants.

  5. In the bugzilla but linked to above, people such as John Klensin are arguing that the browser shouldn't concern itself with converting the form entry anyway, since email systems do that.

  6. Others in the bugzilla thread suggest that there should be different types of form, ie. type=email that accepts EIA, and a type=ascii-only that people can use if they have a particular reason for limiting to ascii.

  7. I assume that if a user types an internationalized email address in a field that is looking for an id, rather than sending email, then conversion to punycode or any other escaped form is not appropriate either. Perhaps you will say that the developer shouldn't have used input type=email in this case. If so, ...

  8. ... i would argue that the scope of use for this form field type really needs to be made much clearer in the spec, so that developers are clearer about when and when not to use it.

I'd like to see the spec updated to take into account the relevant points above, but regardless of any of those changes, i'd also like the spec to carry a (probably informative) description of when type=email should and should not be used (and if the expectation is that content developers should use vanilla input forms for certain things, advice to that effect).

It would probably also be useful for ICANN to express their use cases as part of the discussion.


[1] Tests:

An address like ascii@ascii.com causes the browser to behave as expected on the 4 major browsers.

The address abc@सम्यूर्ण.com is blocked on Chrome, Safari, and Edge, but does work on Firefox.

The address सम्यूर्ण@सम्यूर्ण.com is blocked on FF, Chrome, Safari, or Edge.

Note that Chrome produces an error message that specifically points to non-ASCII characters being unacceptable in email addresses.


WHEN CREATING A NEW ISSUE DO SO ABOVE THIS PARAGRAPH, REPLACING THE PROMPTS, BUT LEAVE THIS PARAGRAPH INTACT AS WELL AS THE TEXT BELOW IT When this issue is raised in the github/bugzilla/mail of the WG that owns the spec, use the text above this para as the basis for that comment. Then edit this issue to remove this paragraph and ALL THE TEXT ABOVE IT. Replace the text 'link_to_issue_raised' below with a link to the place you raised the issue, but leave the remaining text below this para unaltered.

This is a tracker issue. Only discuss things here if they are i18n WG internal meta-discussions about the issue. Contribute to the actual discussion at the following link:

§ link_to_issue_raised

himorin commented 5 years ago

It seems ICANN document mainly states and discusses about importance of IDNA but not a full EAI, except for one reference of research on how much existing system (MTA domain) do support things or not (from IDNA only to full EAI), and I suppose this reference is somehow weak for us to push this change into proposal to WhatWG.

As i understand the WhatWG spec, the user can type anything into such a field and the browser should use punycode to convert non-ASCII characters, on both sides of the @ sign, to ascii for storage and transmission.

I don't think this is correct, since WhatWG spec notes as

User agents may transform the value for display and editing; in particular, user agents should convert punycode in the domain labels of the value to IDN in the display and vice versa.

and punycode is specifically noted only for the domain labels. Of course yes, as in note 1 of original comment of this issue, RFC 6530 notes on downgrading of local part (in sec 8) as

Mechanisms by which such addresses may be found or identified are outside the scope of these specifications as are decisions about the design of originating systems such as whether any required transformations are made by the user, the originating MUA, or the submission server.

so, no transformation on local part is allowed (as noted even in RFC 2821 sec 2.3.10):

Consequently, and due to a long history of problems when intermediate hosts have attempted to optimize transport by modifying them, the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address.

So, note 2 on left side (local-part) is not correct. There SHALL be no way to convert local-part by UA, except by the host specified in the domain part of the email address.

EAI (Email Address i18n) introduced SMTPUTF8 as a new ehlo-keyword to SMTP EHLO command to notice connected MTA that the target MTA supports extended unicode EAI (along with some extentions to email header part without using 8bitmime), but there is no backward compatibility to non-support MTAs which just shall return error but not to try dealing with EAI. Seeking current situation around SMTPUTF8, most of large online email service provides do support EAI like Gmail/G-Suite or Office 365 (supported early this year), some MUA supports like Outlook (2016) but some are not like Thunderbird (need to be checked bmo, but AFAIK). But SMTPUTF8 is only supported by small number of MX zones, like ~4% supported in .ru (Russian) even in Oct/2018. Considering these situations, and thinking as web service developer point of view, an option noted in note 6 seems feasible for me, like

For 3rd point, RFC 6530 sec 7.1 notes as:

When the local part of the address includes characters outside the ASCII character repertoire, use of ASCII-compatible encoding (ACE) [RFC3492] [RFC5890] in the domain part is discouraged to promote consistent processing of characters throughout the address.

klensin commented 5 years ago

Richard, I hope what I'm about to say has been clear from what I have said on the calls or previous discussions of related topics, but, just to get a comment on your "problems i see with the current spec text" points 2 and 3 into the file...

The specs produced by the IETF EAI WG (formally called, as those documents specify, "SMTPUTF8" because "EAI" is just the name of a now-closed WG) are extremely explicit that transformation of a local-part (the left side of the "@") to an all-ASCII form is not only not a requirement but prohibited except as part of the final delivery process. There is not only no need to convert to an internal ASCII form, there is no way to do so. Because many other aspects of mail addresses, headers, of handling interact with having non-ASCII local parts, an email origination and delivery path either entirely support SMTPUTF8 or they don't. And, if they don't, and with the understanding that this isn't anything that can be tested lexically, the mail won't go through.

The barrier to any sort of ASCII-compatible encoding of the local part is that the mail transport protocol, SMTP, has been extremely flexible about the local part since its first stable version as RFC 821 in 1982. At least one reason for the flexibility is that, since nearly the dawn of the ARPANET, email has been used to transport information other than interpersonal messages for humans and local-parts, as well as subject lines. have been used to carry instructions or metadata. Local parts that encapsulate the addresses of completely different mail systems on the other side of gateways pose similar problems. So, for example, some email systems are quite sure they know what a "%" means and it has to do with message routing, not hexadecimal encoding of inconvenient characters. Similarly, a hyphen or two, slashes, colons, etc., are as or more likely to be pieces of a command line for some system as an indication of a special encoding. The delivery system can interpret those characters any way it likes but there is no general way for an originating or relaying system to guess accurately at what the delivery system will do.

So the answer to your question about whether it "is actually necessary, or indeed appropriate, to transform the left-hand side to ascii" is "neither necessary nor appropriate".

john
Ponant commented 5 years ago

abc@सम्यूर्ण.com works on chrome if you copy paste it.

xfq commented 4 years ago

Related: https://github.com/w3c/i18n-activity/issues/778