Open RiccardoM opened 9 months ago
Probably the
Public
name is not the best one there could be (as IP addresses are also public), so if anyone has any suggestion please feel free to chime in.
Is this concept of "public URL" well understood or defined somewhere, or did you come up with it here?
Personally, I think the better approach would be to expose more information with the existing regular expressions, like we did in https://github.com/mvdan/xurls/commit/09d66fb475fb3e22da5d04135c6c168f1038d40b, and then you can use Relaxed and do any filtering that you see fit. You could, for example, discard any matches with a scheme other than http
or https
, discard any hosts which use an IP address, or discard any TLD not part of an allowlist.
This approach wouldn't be significantly faster or slower I think, but what matters is that it would be more configurable to one's needs. Unless "public URL" is a very well defined and understood concept, I think that would be the way to go.
I was considering forking and doing a PR for the a very similar thing.
I would agree that there might be a lacking concensus of what a Public URL actually is, but the function could be renamed to be WebURL or something like that to filter out all other shcemas other than http(s). Then you could have a RelaxedWebURL that would allow for either http(s) or schemaless.
Alternatively, it would be nice to have a function that would accept a slice of schemas as its input, along with a relaxed boolean to accept schemaless as well.
I can see in the commit you referenced that schemeless URLs were suggested to be filtered in the same manner. Maybe this is the right way, but it seems a bit user unfriendly.
The public suffix list initiative from Mozilla has defined a public suffix (aka effective TLD or eTLD). A good description here: https://pkg.go.dev/golang.org/x/net/publicsuffix. Often people care about eTLD+1.
A "public suffix" is one under which Internet users can (or historically could) directly register names. Some examples of public suffixes are .com, .co.uk and pvt.k12.ma.us. The Public Suffix List is a list of all known public suffixes.
@cspeidel we already use the publicsuffix list for TLDs: https://github.com/mvdan/xurls/blob/09d66fb475fb3e22da5d04135c6c168f1038d40b/generate/tldsgen/main.go#L80
It also occurs to me that this is almost exactly xurls.StrictMatchingScheme(`https?://`)
as shown in the existing example. The only difference from what @RiccardoM lists in the original three requirements is that IP addresses are also matched. If we expose a bit more information about what parts of a regular expression were matched, like in 09d66fb475fb3e22da5d04135c6c168f1038d40b, then one could do something like:
rx, err := xurls.StrictMatchingScheme(`https?://`)
if err != nil {
panic(err)
}
idxTLD := rx.SubexpIndex("tld")
for _, match := range rx.FindAllStringSubmatch(s, -1) {
if match[idxTLD] == "" {
continue // skip matches without a TLD, e.g. ipv4 or ipv6
}
fmt.Println(match[0])
}
This is slightly more code, but it gives the end user a lot more flexibility in choosing what schemes, TLDs, or hostnames are acceptable. For example, I would argue that github.com/foo
is as much of a "web URL" as https://github.com/foo
, and both are linkified by many apps the same way, so I'm not sure we can agree on a single definition for "public web URL". Similarly, the example I share above matches punycode TLDs as well - if you wanted to not include those, the addition of the tld
subexpression index allows you to filter the results however you want.
All the above said, I agree that there should be top-level funcs for common patterns, and that's why I added xurls.StrictMatchingScheme
with the example in the first place. I'm just not convinced that we all want the same out of a "public web URL" feature, so the only option that I think is reasonable is to offer more flexibility.
This PR adds a new
Public
function that returns a regex that matches public URLs. Such URLs are defined as:http
orhttps
as their protocol (all other protocols will not be matched)Probably the
Public
name is not the best one there could be (as IP addresses are also public), so if anyone has any suggestion please feel free to chime in.