mvdan / xurls

Extract urls from text
BSD 3-Clause "New" or "Revised" License
1.19k stars 116 forks source link

add a mode to only get relative urls? #70

Closed rew1nter closed 1 year ago

mvdan commented 1 year ago

What do you mean by relative URLs? Do you have an example?

revett commented 1 year ago

They are referring to /about/team.

Where https://www.example.com/about/team would be an absolute URL.

mvdan commented 1 year ago

Right. But the problem is that there would be lots of false positives with paths which aren't URLs. For example, if you take a look at https://en.wikipedia.org/wiki/Unix, you see sentences like:

Kernel – source code in /usr/sys, composed of several sub-components: lib – object-code libraries (installed in /lib or /usr/lib).

So we can't simply match any string that looks like an absolute path or relative URL.

Perhaps you meant making the tool more HTML-aware, so that it would know to match the relative URL from <a href="/relative/url">link text</a>, but not match anything from foo bar /relative/url baz. I guess we could do that, but xurls is currently for extracting URLs from text, not from HTML. It is not really aware of HTML, and I would rather not make it start parsing or understanding HTML. If you want to extract URLs from HTML, that's always possible with an HTML parser, in my opinion - and that will be much more precise than matching via regular expressions.

mvdan commented 1 year ago

Closing for now per the above. Happy to reopen if I missed anything in my analysis.