square / okhttp

Square’s meticulous HTTP client for the JVM, Android, and GraalVM.
https://square.github.io/okhttp/
Apache License 2.0
45.87k stars 9.16k forks source link

Add support for IDNA 2008 (RFC 5891) #6910

Closed j-bernard closed 1 year ago

j-bernard commented 3 years ago

I'm aware of issue https://github.com/square/okhttp/issues/1615 but I think there is an easier way to fix this.

There is an trustworthy IDN implementation that actually offers IDNA 2008 compatibility, ICU4J, see IDNA class.

The Android platform already replaced the java.net.IDN methods with ICU4J however they kept using the static methods that are deprecated as only compliant with IDNA 2003.

See my bug report on Android issue tracker for more details and snippet to correctly use ICU4J with the right flags.

yschimke commented 3 years ago

If this is part of the public Android APIs, we could potentially use the better implementation on Android, while leaving JDK as is.

How common are these domains? What are some examples that are currently broken?

image
j-bernard commented 3 years ago

See https://unicode.org/reports/tr46/#IDNAComparison for differences between the new and old standards.

I don't have statistics about these domains but at least one pretty common German character (Eszet: ß) is impacted. I found a domain that seems to have been created especially to illustrate the problems of non-compliance with IDNA 2008, see this post: http://straße.de/. You won't get the same page with Google Chrome (IDNA 2003) and Firefox (IDNA 2008).

yschimke commented 3 years ago

@j-bernard Care to take a look at https://github.com/square/okhttp/pull/6912

What flags should I be passing to getUTS46Instance?

It seems this is safe down to Android 24, is that your reading?

j-bernard commented 2 years ago

@yschimke thanks, I'll have a look at the PR.

You should pass the following flags IDNA.CHECK_BIDI, IDNA.CHECK_CONTEXTJ, IDNA.CHECK_CONTEXTO, IDNA.USE_STD3_RULES and depending on the way you convert IDNA.NONTRANSITIONAL_TO_ASCII or IDNA.NONTRANSITIONAL_TO_UNICODE.

Android doc seems to say API level 24 but I did not conduct any test on old versions.

yschimke commented 2 years ago

Going to close this out as the strictness of IDNA 2008 is likely to cause more visible issues than this solves, particularly as this isn't uniformly supported or implemented by clients and servers.

arnt commented 1 year ago

JFYI: IDNA2008 solves more problems than it creates. When I saw this PR I became curious and checked.

I checked about 1.5m IDN domains in more than a thousand TLDs. When you have an IDNA2003 implementation, IDNA2008 maintains compatibility with >99,999% of IDN domains, gains compatibility with the ones that din't work yet in IDNA2003, and breaks almost nothing. It's difficult to judge precisely due to the existence of test domains, but saying that it breaks one domain in a million isn't unreasonable.

The main domain broken is ≠.net, which is broken because IDNA says that slashes aren't allowed and IDNA2008 extends that to ban composed codepoints that decompose to slashes. I'm not sure whether that domain's a test domain. If not, then moving to IDNA2008 would break use of it.

There's also ·.shop, a dozen or more than combine latin letters and a CJK hyphen, and some that couldn't be entered here because they're not legal unicode. Those can't be shown on-screen, are forbidden by IDNA2008, but are sort of legal according to IDNA2003.

So: don't worry about breaking anything with a move to IDNA2008. Wanting to stay with the base platform's IDN implementation is IMO a very good argument against a move. Worry about breakage is not.

I can rerun the analysis later, if anyone wants up-to-date results.

swankjesse commented 1 year ago

@arnt great timing, thanks for the explanation. I’ve been doing HTTP stuff for a long time but I’m new to IDN so advice here is very welcome.

Any advice on how I could source my own list of 1.5M IDNs for testing? Good tests make it much easier to write good code.

arnt commented 1 year ago

My advice is to get a CZDS account and not bother with anything else. You can get more zone files if you approach each registry separately, but using just the CZDS provides a lot of domains for little effort. ICANN also provides code to download the files in bulk.

FWIW I'll need more zone files for various work chores, and will rerun my analysis as I get them. But I don't expect to uncover many new problems. A problem has to be very rare indeed to not show up in the first million cases.

Your next challenge will be to use IDNA2003 and -2008 in the same program at the same time, with no cross-contamination. I decided on ICU4J.

swankjesse commented 1 year ago

Done! See https://github.com/square/okhttp/issues/7008

arnt commented 1 year ago

I looked at the code recently, great work and thanks. And while it was perhaps not your top priority, you made a frustrating day end well for me ;)

I had to think a little about whether your choice to use nontransitional is right, but overall I think it is.