golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
121.02k stars 17.36k forks source link

x/net/idna: apply nameprep normalization algorithm #16501

Closed mna closed 7 years ago

mna commented 7 years ago

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
go version go1.7rc1 darwin/amd64
  1. What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
  1. What did you do? If possible, provide a recipe for reproducing the error. A complete runnable program is good. A link on play.golang.org is best.

https://play.golang.org/p/zS-UR4WhIx

func main() {
    in1 := "www.\u00e9tat.com"  // e-acute in one rune
    in2 := "www.e\u0301tat.com" // e-acute in two runes, "e" + acute

    // golang.org/x/net/idna
    got1, err1 := idna.ToASCII(in1)
    got2, err2 := idna.ToASCII(in2)
    fmt.Println(got1, err1) // www.xn--tat-9la.com <nil>
    fmt.Println(got2, err2) // www.xn--etat-vvc.com <nil>

    // github.com/DanielOaks/go-idn/idna2003
    got1, err1 = idna2003.ToASCII(in1)
    got2, err2 = idna2003.ToASCII(in2)
    fmt.Println(got1, err1) // www.xn--tat-9la.com <nil>
    fmt.Println(got2, err2) // www.xn--tat-9la.com <nil>
}
  1. What did you expect to see?

When running idna.ToASCII, it should perform a normalization of unicode before encoding to punycode (https://en.wikipedia.org/wiki/Internationalized_domain_name, section "ToASCII and ToUnicode": "ToASCII will apply the Nameprep algorithm, which converts the label to lowercase and performs other normalization, and will then translate the result to ASCII using Punycode").

The golang.org/x/net/idna does not seem to perform that normalization step, while e.g. the userspace github.com/DanielOaks/go-idn package does.

So running idna.ToASCII on www.état.com and on www.e\u0301tat.com should (if I understand IDNA correctly) return the same punycode form: www.xn--tat-9la.com.

  1. What did you see instead?

The userspace package correctly returns www.xn--tat-9la.com for both inputs, but x/net/idna returns "www.xn--tat-9la.com" and "www.xn--etat-vvc.com".

mna commented 7 years ago

My bad, it seems that rfc-5891 ("Internationalized Domain Names in Applications (IDNA): Protocol") obsoletes the "nameprep" rfc-3491 ("Nameprep: A Stringprep Profile for IDN") and states in "Appendix A. Summary of Major Changes from IDNA2003":

Remove the mapping and normalization steps from the protocol and have them, instead, done by the applications themselves, possibly in a local fashion, before invoking the protocol.

So I guess x/net/idna does the right thing and it is up to the caller to normalize or not. Though it means the caller should know whether a domain in non-normalized form is equivalent to one in normalized form, which I have no idea if it is (maybe it is incosistent in the wild, registration for www.\u00e9tat.com and www.e\u0301tat.com may or may not be separate domains?).

If anyone knows about that last part, I'd love to know (it would be very helpful for the purell normalization package that I maintain), but otherwise this is not an issue for the idna package, so I'll close it.

mna commented 7 years ago

Re-nevermind that last part, rfc-5891 states that:

By the time a string enters the IDNA registration process as described in this specification, it MUST be in Unicode and in Normalization Form C (NFC)