medialize / URI.js

Javascript URL mutation library
http://medialize.github.io/URI.js/
MIT License
6.26k stars 474 forks source link

Constructor assigning IP addresses to path instead of host #126

Closed indolering closed 10 years ago

indolering commented 10 years ago

uri = URI("208.113.212.187") uri.path -> 208.113.212.187 _string -> ""

It may not be RFC complaint, but no-one adds http:// to their IP addresses.

ooxi commented 10 years ago

Well I don't think it's possible to cover all use cases. Especially since 208.113.212.187 is a relative path, when interpreted as URI.

rodneyrehm commented 10 years ago

unless you prepend your URI with //, the parser has no (definitive) way of distinguishing host from path. We certainly could match a given string against /^(www\.|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/i - but that would break things for "poorly" named directories. I'm not sure this is a route I want to take…

rodneyrehm commented 10 years ago

… but I'm open to discussing how we can add this as a plugin / opt-in behavior!

indolering commented 10 years ago

I was going to extend the URI.js prototype to simply force the behavior you describe by default. However, you could throw in a 'sniff' function that uses heuristics instead of the RFC to determine the URL type. You could also enable an optional parameter for a custom lexer, function(), or shudder regular expressions to define the such additions on the fly. I'm going to have to add in URIs for I2P, Tor, and Freenet, so it would be pretty handy.

rodneyrehm commented 10 years ago

I2P, Tor, and Freenet

?

The simplest and dumbest monkey patch right now is:

URI.parse = function(parse){ 
  var schemalessHost = /^(www\.|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/i;
  return function(){
    if (arguments[0].match(schemalessHost)) {
      arguments[0] = '//' + arguments[0];
    }
    return parse.apply(this, arguments);
  };
}(URI.parse);

URI("123.123.123.123").host() === "123.123.123.123";
indolering commented 10 years ago

Yes, that's along the lines of what I was planning on doing originally. But in terms of a general mechanism for additional URI types, I was thinking of something along the lines of array.sort(function(a,b){ .... return a}), so:

URI("123.123.123.123", function(uri){
    if (uri.match(/^(www\.|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/i))
      uri = '//' + arguments[0];

    return uri;
  });

While I think a general mechanism like that is useful, it would be much nicer to just pass in type information like: URI("123.123.123.123", 'ip'). Of course, the reason I am using URI.js is so I can throw URI("123.123.123.123/random/stuff.html#!appstate", 'ip') and not worry about parsing all the junk at the end. I'm not sure how this hinting would factor into your code's logic.

Overall, it would be nice if there was a separate heuristic parser that matched the logic used by the browser vendors. I know it is hardly standardized but it's the baseline which people 'test' against and a raw IP or IPv6 being used as a relative directory is an extreme edge case. Whether it is a function such as URI.sniff("123.123.123.123") or a boolean passed along with standard constructor URI("123.123.123.123", true) I would think that a heuristic parser would be pretty core to URI.js's mission.

In regards to additional types, I will (eventually) need to parse non-standard URIs such as 3g2upl4pq6kufc4m.onion and exotic hash URIs. I doubt I will be the one coding it, but I spent some time looking into how your .is() function works and I was planning on just adding types there and using the manual construction function. However, it would be nice if there was a general mechanism for adding arbitrary URI types and parsing logic (I suspect that the URI template system might be useful here).

And since I am pontificating on URI parsing, I feel that I should express that I am bitterly opposed to regex, lexers are ... less awful! : D

rodneyrehm commented 10 years ago

reopen to look at this again for the next version

ooxi commented 10 years ago

I really don't get your point, why would anybody specify 123.123.123.123/random/stuff.html as URI if he means ://123.123.123.123/random/stuff.html?

indolering commented 10 years ago

I really don't get your point, why would anybody specify 123.123.123.123/random/stuff.html as URI if he means ://123.123.123.123/random/stuff.html?

Well, IPv6 has bajillions of addresses which is pretty useful if you are trying to avoid censorship via IP blacklists. In my case, I'm trying to create a javascript DNS router for .bit URL's which requires de-mangling URL hacks.

rodneyrehm commented 10 years ago

What's the consensus on this? I'm reluctant to add heuristics to URI.parse()

ooxi commented 10 years ago

Imo we should keep the code as it is. Heuristics should not be part of the core library when it changes current behaviour.

rodneyrehm commented 10 years ago

Well then, no heuristics it is.