Email address complexity - for next charter

owner:resnick@episteme.net type_enhancement | by dougfoster.emailstandards@gmail.com

The From header specification has become so complex that it is extraordinarily difficult to parse reliably.

The first complexity comes from necessary internationalization support. From headers may require character set conversion into Unicode. Even though only UTF-8 encoding is RFC-compliant, a typical implementation should be prepared to convert any properly labelled character set encoding into Unicode. Character set encoding may be applied to all or parts of the From header text, so the encoded and non-encoded sections must be reassembled into a single Unicode string. Once all of the character set encoding is resolved, the address is ready to be parsed.

From headers permit a list of address terms, where each term contains an email address which is optionally preceded by a friendly name. The interaction of various delimiters is inherently difficult. Address terms should be delimited with commas, email addresses should be delimited with angle brackets, and friendly names should be delimited with quotes. However, email addresses can contain quoted local-parts. Quoted strings for both friendly name and local-part can contain angle brackets and commas. I am uncertain whether a quoted string can contain an embedded quoted. All of this complexity makes delimiter boundary detection difficult and error-prone.

A working implementation must also be prepared to work around non-compliant implementations that omit delimiters on either the friendly name or the email address or both. When the delimiters are present on both Friendly Name and Email Address, the white space between them, which also acts as a delimiter, may be missing. When delimiters are missing, the parser must use position-dependent interpretation, but the optional nature of the friendly name introduces complexity even into position-dependent interpretation. If I detect a sequence of two undelimited email addresses, does this represent two From addresses or one Friendly Name and one From address?

The From header requires correct parsing because of its multiple uses:

To determine the default Reply-To address(es).
To apply DMARC policy
To display Friendly Name and From address in the mail user interface, usually as independent strings that are displayed in different contexts.

Many email environments use multiple, independently-developed products. Each of these must implement its own parsing logic, and the likelihood of consistent implementations will decrease as the number of products increase.

Upon understanding this problem, I finally realize while some email filtering products do not implement DMARC or do not evaluate the From address at all.

Much of this complexity extends to any context that uses an email address or an identifier that looks like an email address. Some of the most important:

I think an SMTP Mail From address should be internationalized using UTM-8 encoding on the local-part, where punycode is limiting and not required, with punycode on the domain part. If the local-part requires quotes, the quotes could be inside or outside the encoding. I don't recall if this is spelled out clearly in any RFC. And maybe I just don't understand.

Email addresses also appear in the "i=" clause of an email address, which affects message validation.
Email addresses are captured in the Authentication-Results header, and its components, which are used for communicating message validation within an administrative domain. These these are rolled into ARC headers which are used to communicate results between administrative domains.

Remedies

I am a little overwhelmed by the present situation, but these remedies come to mind:

Products need to provide system administrators with data about difficult-to-parse addresses, options about which non-compliant formats to allow or block in general, and whitelisting mechanisms to accept formats from highly trusted sources even if the same problems are rejected in the general case.
I suggest that multiple-address From headers are not actually needed and should be deprecated. I note, in support of this suggestion, that Gmail rejects multiple-from addresses, and apparently began doing so when it implemented DMARC.
I suggest that local-part addresses do not require white space, commas, or angle brackets, and these should be prohibited.
I suggest an informational RFC is needed which collects all of the address complexity considerations into a single document.

My perspective is skewed by operating in an U.S.-based, English-only environment, on a relatively small mail system. In my lifetime, I have never actually seen a quoted local-part name or a multiple-from address. That very absence leaves me worried that my filtering products are poorly prepared to deal with surprises, and that a nation-state actor will find ways to use my vendor's untested logic to create zero-day attacks.

Issue migrated from trac:57 at 2022-01-31 12:38:41 +0000

ietf-wg-emailcore / emailcore

Email address complexity - for next charter #57