As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.
A PR is in the works.
Example no. 1
The crawler crawls the page https://www.mydomain.com/images/
The following base tag is present on the page
<base href="/pages/">
The following link is present on the page
<a href="subpage.html">Lorem ipsum dolor sit amet</a>
The HyperLinkParser returns the URI https://www.mydomain.com/images/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpage.html instead
Example no. 2
The crawler crawls the page https://www.mydomain.com/pages/
The following base tag is present on the page
<base href="subpages/">
The following link is present on the page
<a href="subpage.html">Lorem ipsum dolor sit amet</a>
The HyperLinkParser returns the URI https://www.mydomain.com/pages/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpages/subpage.html instead
Description
As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.
A PR is in the works.
Example no. 1
https://www.mydomain.com/images/
HyperLinkParser
returns the URIhttps://www.mydomain.com/images/subpage.html
, but would be expected to have returnedhttps://www.mydomain.com/pages/subpage.html
insteadExample no. 2
https://www.mydomain.com/pages/
HyperLinkParser
returns the URIhttps://www.mydomain.com/pages/subpage.html
, but would be expected to have returnedhttps://www.mydomain.com/pages/subpages/subpage.html
instead