Closed rew1nter closed 1 year ago
They are referring to /about/team
.
Where https://www.example.com/about/team
would be an absolute URL.
Right. But the problem is that there would be lots of false positives with paths which aren't URLs. For example, if you take a look at https://en.wikipedia.org/wiki/Unix, you see sentences like:
Kernel – source code in /usr/sys, composed of several sub-components: lib – object-code libraries (installed in /lib or /usr/lib).
So we can't simply match any string that looks like an absolute path or relative URL.
Perhaps you meant making the tool more HTML-aware, so that it would know to match the relative URL from <a href="/relative/url">link text</a>
, but not match anything from foo bar /relative/url baz
. I guess we could do that, but xurls is currently for extracting URLs from text, not from HTML. It is not really aware of HTML, and I would rather not make it start parsing or understanding HTML. If you want to extract URLs from HTML, that's always possible with an HTML parser, in my opinion - and that will be much more precise than matching via regular expressions.
Closing for now per the above. Happy to reopen if I missed anything in my analysis.
What do you mean by relative URLs? Do you have an example?