Closed ninapavlich closed 4 years ago
Going to summarize discussion from the PRs here for visibility:
There are basically two ways I think this can go. Either it stays as-is -- rejecting any email address with multiple @
characters in it -- or it needs to use a full RFC-compliant parsing library to get the local-part and domain of the address. There are a couple of those available for Python, with flanker looking the most comprehensive, though also seemingly unmaintained for Python 3.
If it continues to reject these addresses, and I think that's a perfectly acceptable choice given that both Django itself and HTML5-compliant web browsers already don't allow everything that the email RFCs would allow, then I think the check should be as simple as possible: count the number of @
characters in the address before performing the confusables checks, and if it's not exactly 1 then raise ValidationError
, probably with the default invalid message of django.core.validators.EmailValidator
.
A full patch for this would need tests to verify that an otherwise-valid address with multiple @
in it is rejected with the right message, and an update to the documentation for validate_confusables_email
to indicate that it will reject this type of address.
It might also be worth looking into other types of "technically valid" but problematic address formats and maybe even eventually writing a different validator to catch those.
I think I may take a different approach here, probably for django-registration 3.1. I'm already a bit frustrated by how lax Django's own email validator is, and I'm on record as being much more supportive of the email validation solution provided by HTML5 (for use in input type="email"
).
So I think what I'm going to do is drop in an extra validator that applies the HTML5 validation rule for email addresses which -- as far as I can tell -- does not allow addresses with multiple @
in them.
Writing a slightly long comment here to provide plenty of context so I can link to it and get some feedback from other people.
One thing django-registration tries to do is protect you against homograph attacks; these are attacks where someone tries to register a name (for the most general possible sense of "name", so including usernames, email addresses, domains, etc.) that looks similar or even identical to an already-existing one. Scam sites which use a non-Latin 'a' to visually resemble a well-known brand like PayPal or eBay are examples of this.
During the signup process, django-registration applies a validator to both the submitted username and the submitted email address to check for potential homograph attack, rejecting any that are what Unicode terms "mixed-script confusable" (contain characters from multiple scripts, and contain characters that appear in the Unicode Visually Confusable Characters database).
For usernames, this is easy. For email addresses, it is slightly more difficult, because many people don't get a choice of the domain for their email, so might end up with an address whose local-part is one script and whose domain is another. To handle this, the validator splits the address into local-part and domain, and considers each one separately.
Or, at least, it tries to: doing anything with an email address is difficult because of how ludicrously complex the grammar of email addresses is. The current iteration of the validator will choke on addresses which contain more than one @
(U+0040), even if those addresses are otherwise legal (and yes, it is possible to construct a legal email address containing more than one @
). I would like to fix this for the next release of django-registration.
Unfortunately, I don't know of any maintained open-source library, supported on all versions of Python django-registration needs to support, which is capable of performing an RFC-compliant parse of an addr-spec into local-part and domain. If there is one, I'd like to know about it (I haven't gone spelunking into the internals of the Python standard library's email
package, since so much of it is undocumented, but I admit it's possible there is a useful parser in there).
So one option is simply to ignore -- accept without attempting to validate -- any potential email address containing more than one @
, but this opens up the (admittedly quite small) risk of someone mounting a homograph attack using an address which is both visually confusable for an already-registered one, and somehow accomplishes this while using address-syntax tricks to make the validator bail out without performing any homograph check.
Another option would be to have the validator reject as invalid any address which does not contain exactly one @
. This would reject some valid addresses, but Django's own EmailValidator
-- which is already used during the signup process -- also rejects some valid addresses. This is practical, but inelegant; the homograph-attack validator should have a single responsibility, and should not also be deciding what is or isn't a syntactically-valid address.
A third option I'm currently considering is to add another validator in front of the homograph check, which would apply HTML5's email-address validation rule to the email address, and reject any which do not conform to HTML5's definition of an email address. HTML5 openly admits that its validation rule (which provides both normative ABNF and an informative regex) is a "willful violation" of the governing email RFC. But it does appear to reject the entire category of addresses that would cause trouble for the homograph validator, which means the homograph validator could then adopt the simple option of silently ignoring anything which does not contain exactly one @
, and trust that the HTML5 validator has already rejected problem values.
That is arguably a backwards-incompatible change, since some addresses that are accepted as valid by django-registration today would stop being accepted after the change. But I am and always have been willing to accept that kind of incompatibility in pursuit of improved validation and protection (the homograph validators themselves were backwards-incompatible in this sense when they were added).
I'd like to get django-registration 3.1 out sometime soon with a fix for this, but first I want to hear opinions and ideas on the right fix. I'm open to arguments for any of the options above, suggestions of practical parsing libraries that can handle the root problem, or alternative suggestions.
Only feedback I have that might be relevant is the W3C spec (as opposed to the WHATWG one) allows Unicode local parts etc: https://w3c.github.io/html/sec-forms.html#email-state-typeemail (as it refers to the RFC6531 changes to RFC5322).
The W3C version of the spec simply defers the definition of a syntactically-valid address to the RFC. The whole point of adopting the HTML5 validation rule is to break compatibility with the RFC and reject addresses the RFC considers valid.
Yes, I understood that, but I think there is/could be/should be a difference between multiple @s, quoted strings, comments, etc etc etc, and Unicode local parts. https://github.com/whatwg/html/issues/4562 was recently reopened to look at this again. Was just an idle thought.
The thing is, W3C's version of HTML5 just says "follow the RFC", and "follow the RFC" is the source of this issue -- if you know of a drop-in RFC-compliant parser which can split an addr-spec into local-part and domain, feel free to suggest one. But without such a parser, the WHATWG approach of willfully breaking the RFC, to eliminate many problematic edge-case but RFC-permitted values, is where I'm leaning. And only WHATWG's version does that, and only WHATWG's version provides even an informative machine-readable rule for validating. If and when WHATWG updates to support RFC6531-inspired addresses with a machine-readable rule, or if and when W3C provides a machine-readable rule in their version, I'd consider updating an "HTML5 email validator" in django-registration. Until then, though, the W3C approach simply is not useful here.
In the standard library (but only since 3.6 (3.3 provisionally)), you have: email.headerregistry.Address(addr_spec='email@address.here')
which splits up as you ask, but no Unicode local part support though (which is odd as the rest of the email
library supports it fine! Have reported upstream).
Perhaps of more interest, in the manner of this issue and the WHATWG spec: email_validator
: https://github.com/JoshData/python-email-validator - it doesn't allow quoted strings or IP address domain names, and does split including Unicode support in both local-part and domain. e.g.
from email_validator import validate_email
email = 'caf\u01e9@mydomain\u1234.tld'
print(validate_email(email, check_deliverability=False))
{
'local': 'cafǩ',
'domain': 'xn--mydomain-vk7a.tld',
'domain_i18n': 'mydomainሴ.tld',
'smtputf8': True,
'email': 'cafǩ@mydomainሴ.tld'
}
The WHATWG pattern is a reasonable compromise between the ASCII strings that RFC 5322 allows in an address and what real mail systems allow. I encourage you to use it.
The W3C spec is theoretically correct but bad news in practice. Unicode addresses, known as EAI, are an extension to the mail system and although most mail software has EAI support in the code, most actual mail systems don't have it turned on. A few big ones like Gmail and Outlook/Hotmail do, and it's not hard to turn it on in Postfix or Exim, but you have to do it. As someone else noted, the EAI support in Python libraries is pretty good but if you're going to enable EAI addresses, you need some way to test whether the underlying mail system supports them. Otherwise you're going to have strange bugs where the application and python code accept a valid EAI address (mine is yés@nø.sp.am) but mail to it fails when handed off from python code to the MTA.
Current master (which will become 3.1) is going with the WHATWG validator approach.
In validators.py, function validate_confusables_email(value), the function handles the case where the '@' sign is missing, but the one where there are multiple '@' signs in the string.
This is of course an edge case, but we ran into it on our server so I thought I would log a ticket since it's easy enough to fix.