Implement URL parser - Githubissues

sandysage commented 6 years ago

Based on https://gist.github.com/linuxwolf/5722b60907826b47a97b10cf36676ca2
Use Mozilla public suffix list: https://publicsuffix.org/

sandysage commented 6 years ago

This needs to come before 'fill'.

jimporter commented 6 years ago

Structurally, my plan here is to implement a matchURL() function that takes a query and an entry, and returns a bool indicating whether it matched. Then we can apply that function over the list of items to determine the matched items. This way, a given query can be mapped to multiple alternatives (e.g. wwww.facebook.com and facebook.com) as well as doing "fancy" things with checking port numbers and such.

One possible enhancement would be for matchURL() to return an integer score ranking the match. That way, better matches are higher-ranked and thus listed earlier in the results. That's something I think we could delay for a while, since it should only be noticeable in certain edge cases.

linuxwolf commented 6 years ago

@jimporter some questions/comments:

What is the query? Is this the full URL as retrieved from the active tab (for instance), or is it something shorter?
While a stretch goal, what are the edge cases you see would be mitigated by a scoring system?
We need to adhere to origin as much as possible, which includes scheme and port. Not doing so can (and has) led to vulnerabilities in other systems.

jimporter commented 6 years ago

@linuxwolf:

The query would be everything before the path in the URL (minus HTTP basic auth username/password, of course), so basically URL.protocol + "//" + URL.host; that will give us protocol, hostname, and port (with port being implicitly-defined if it's the protocol's default). That should be sufficient for any searches we want to do, while also looking readable and being easy enough for a user to type on their own.
1. Our matchURL() function would accept a wider variety of queries (i.e. user-provided ones), but would effectively transform them into a query of the above format before doing anything. For example, if a user enters www.facebook.com, it will be as if they entered https://www.facebook.com (which in turn is equivalent to https://www.facebook.com:443).
Scoring could help if you have accounts on subdomains that you want to be higher-priority than an account on the root domain. I can't think of many real-world sites that do this, but it may help for some intranet sites.
As mentioned a bit in (1), the goal is to provide an easy way to specify the origin we're looking for without losing any of the information we want. Then matchURL() can strictly match the origin. In general, matching the URL is different than matching any of the other fields in an item, since we want to perform an exact match on the URL, whereas we'd be ok with something like substring-match for site name, username, etc.

matchURL(), well, matches the interface for the function we already have to do this: filterItem(). In practice, I expect that we'd have some generic functions that let us do a match on a particular field of an item, and then we could just compose those together to get the final filterItem() function.

linuxwolf commented 6 years ago

As mentioned a bit in (1), the goal is to provide an easy way to specify the origin we're looking for without losing any of the information we want. Then matchURL() can strictly match the origin. In general, matching the URL is different than matching any of the other fields in an item, since we want to perform an exact match on the URL, whereas we'd be ok with something like substring-match for site name, username, etc.

Matching URLs is different, but more complex than an exact match. What wasn't specified in this gist is that both sides of the comparison are parsed and reassembled: the entry's origin(s) and the query argument. It does get complex, but I believe it does a better job of meeting user expectations given available information and providing reasonable guardrails (e.g., secure origins only).

mozilla-lockwise / lockbox-extension

Implement URL parser #521