google / guava

Google core libraries for Java
Apache License 2.0
50.13k stars 10.88k forks source link

InternetDomainName fails to validate URLs with hyphens at the end of a subdomain #1065

Open gissuebot opened 9 years ago

gissuebot commented 9 years ago

Original issue created by dimo414 on 2012-07-12 at 02:13 PM


The InternetDomainClass.from() method raises an IllegalArgumentException when passed a domain name with a hyphen at the end of a subdomain, which does seem to be a valid way to register a subdomain.

Test:     import java.io.IOException;     import java.net.HttpURLConnection;     import java.net.URL;

import com.google.common.net.InternetDomainName;

public class GuavaError
{
    public static void main(String[] args) throws IOException
    {
        URL url = new URL("http://www.vilnius-.hotelreservierung.de/");
        try{
            System.out.println(InternetDomainName.from(url.getHost()).parts());
        }catch(Exception e){
            e.printStackTrace(System.err);
        }

        HttpURLConnection con = (HttpURLConnection)url.openConnection();
        System.out.println(con.getResponseCode());
    }
}

Output:     java.lang.IllegalArgumentException: Not a valid domain name: 'www.vilnius-.hotelreservierung.de'         at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)         at com.google.common.net.InternetDomainName.<init>(InternetDomainName.java:154)         at com.google.common.net.InternetDomainName.from(InternetDomainName.java:225)         at GuavaError.main(GuavaError.java:14)     200

Expected:     [www, vilnius-, hotelreservierung, de]     200

gissuebot commented 9 years ago

Original comment posted by wasserman.louis on 2012-07-12 at 02:21 PM


Hmmm. Do you have a source arguing that this is indeed valid syntax? RFC 952 appears to forbid hyphens at the end of a domain name, but I'm still trying to find specifications on valid subdomain names.


Labels: Package-Net, Type-Defect

gissuebot commented 9 years ago

Original comment posted by cpovirk@google.com on 2012-07-12 at 02:24 PM


InternetDomainClass.validateSyntax refers to RFC 1035, in which I see this:

""" <domain> ::= <subdomain> | " "

<subdomain> ::=

gissuebot commented 9 years ago

Original comment posted by dimo414 on 2012-07-12 at 02:25 PM


I don't have any reason to believe this is RFC compliant, but the URL is valid, in as much as a CNAME/A record exists for it, and there's a website successfully serving content from that URL.

gissuebot commented 9 years ago

Original comment posted by wasserman.louis on 2012-07-12 at 02:27 PM


Bleah. So I guess we have to decide whether to follow the RFC strictly?

gissuebot commented 9 years ago

Original comment posted by em...@soldal.org on 2012-07-12 at 02:28 PM


InternetDomainName.from(host) & InternetDomainName.strict(host) ?

gissuebot commented 9 years ago

Original comment posted by cpovirk@google.com on 2012-07-12 at 02:30 PM


Interesting. I can't get to the site with Firefox, Chrome, or wget. All fail with DNS problems. (Quite possibly they're sharing the DNS code at some level.) Maybe it depends on your OS?

gissuebot commented 9 years ago

Original comment posted by wasserman.louis on 2012-07-12 at 02:31 PM


I can't get to this site, either.

gissuebot commented 9 years ago

Original comment posted by em...@soldal.org on 2012-07-12 at 02:32 PM


I can reach it

gissuebot commented 9 years ago

Original comment posted by em...@soldal.org on 2012-07-12 at 02:32 PM


but chrome first tries to google it, presumably because its not RFC

gissuebot commented 9 years ago

Original comment posted by dimo414 on 2012-07-12 at 02:38 PM


Google serves up such pages in it's search results, www.google.de/search?q=hotels Rottach-Egern currently has:

Hotel Berlin www.hotel-berlin-.tegernsee.de/ 1 Google-Bewertung

As the 7th result (for me)

I'd definitely say this is poor behavior on the site's part, but I feel if people are able to register subdomains with hyphens, InternetDomainName should handle that. Assuming I'm looking in the right place, there's precedence for from() being lenient in similar ways: https://google.github.io/guava/apidocs/com/google/common/net/InternetDomainName.html#from%28java.lang.String%29

gissuebot commented 9 years ago

Original comment posted by wasserman.louis on 2012-07-12 at 02:39 PM


I think I'm convinced. We already grant leniency in a number of areas; this seems like a place where additional leniency is appropriate.

gissuebot commented 9 years ago

Original comment posted by kurt.kluever on 2012-07-24 at 05:39 PM


@cberry: any thoughts on this?


Status: Research Owner: cberry@google.com

gissuebot commented 9 years ago

Original comment posted by cberry@google.com on 2012-07-24 at 05:42 PM


You're correct that rfc1035 disallows this pattern. Now the question is whether we want to create a more permissive exception to this rule. I would lean on the side of not doing so, since (as discussed above) not all tools will agree than such a domain name is well formed.

gissuebot commented 9 years ago

Original comment posted by kevinb@google.com on 2012-07-25 at 12:26 AM


We can already see that these domain names are extremely inadvisable. If we allow them, a user might find only later that they can't do everything with the domain that they thought they would be able to. It seems safer to reject, by that rationale?

gissuebot commented 9 years ago

Original comment posted by dimo414 on 2012-07-25 at 05:05 AM


@kevinb: I have to disagree. Guava isn't in a position to "allow" a domain pattern or not. The RFC forbids it, but we can see that some registrars allow it anyways. To me, the purpose of Guava is to solve common problems, not enforce RFCs. I think it'd be great if someone yelled at the registrar(s) breaking the rules, but that's neither here nor there. The fact of the matter is, these subdomains exist, and currently InternetDomainName explodes when passed a resolvable domain.

For me, and I suspect most other users, we want a convenient way to intelligently parse domain names. To that end, this class has already gone down the "be more permissive" path (see comment #10) so already users don't expect IDN to be a canonical RFC1035 parser.

Perhaps it would make sense to make the parser more permissive, but add an isRfcCompliant() method, that users can query if they are concerned about the domain they're inspecting. Personally, I suspect such a method would hardly ever be used, but I respect your concern, and it might be nice functionality to have.

gissuebot commented 9 years ago

Original comment posted by wasserman.louis on 2012-07-25 at 07:55 AM


I'm still leaning on the side of "be permissive," since that already seems to be the tendency of this method, but I could get behind an isRfcCompliant version, too...

gissuebot commented 9 years ago

Original comment posted by don...@structuredcommons.com on 2013-12-02 at 08:08 PM


Status update?

gissuebot commented 9 years ago

Original comment posted by cberry@google.com on 2013-12-03 at 12:12 AM


This is always going to involve some set of compromises between strict RFC adherence and actual practice on the web. So far the degree of permissiveness already implemented seems to cover nearly all actual needs, so I'm reluctant to reduce strictness further.

gissuebot commented 9 years ago

Original comment posted by dimo414 on 2014-07-09 at 12:52 AM


Just spotted comment #18; I don't understand the point of being both not RFC compliant and not matching actual practices on the web. I'd be fine with either alternative, but being halfway between the two seems completely unhelpful.

gissuebot commented 9 years ago

Original comment posted by cgdecker@google.com on 2014-07-18 at 04:17 PM


Issue #1808 has been merged into this issue.