haacked / feedback

Ask @haacked anything!
7 stars 1 forks source link

I read this when developing a commercial product - do i owe you anything :D #192

Closed adisciple50 closed 6 years ago

adisciple50 commented 6 years ago

Raise your hand if you know how to validate an email address. For those of you with your hand in the air, put it down quickly before someone sees you. It’s an odd sight to see someone sitting alone at the keyboard raising his or her hand. I was speaking metaphorically. Before yesterday I would have raised my hand (metaphorically) as well. I needed to validate an email address on the server. Something I’ve done a hundred thousand times (seriously, I counted) using a handy dandy regular expression in my personal library. This time, for some reason, I decided to take a look at my underlying assumptions. I had never actually read (or even skimmed) the RFC for an email address. I simply based my implementation on my preconceived assumptions about what makes a valid email address. You know what they say about assuming. What I found out was surprising. Nearly 100% of regular expressions on the web purporting to validate an email address are too strict. It turns out that the local part of an email address, the part before the @ sign, allows a lot more characters than you’d expect. According to section 2.3.10 of RFC 2821 which defines SMTP, the part before the @ sign is called the local part (the part after being the host domain) and it is only intended to be interpreted by the receiving host… Consequently, and due to a long history of problems when intermediate hosts have attempted to optimize transport by modifying them, the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address. Section section 3.4.1 of RFC 2822 goes into more detail about the specification of an email address (emphasis mine). An addr-spec is a specific Internet identifier that contains a locally interpreted string followed by the at-sign character (“@”, ASCII value 64) followed by an Internet domain. The locally interpreted string is either a quoted-string or a dot-atom.