Base URI logic of the HyperLinkParser doesn't respect terminated relative base tag values

thedeedawg commented 3 years ago

Description

As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.

A PR is in the works.

Example no. 1

The crawler crawls the page https://www.mydomain.com/images/
- The following base tag is present on the page
```
<base href="/pages/">
```
- The following link is present on the page
```
<a href="subpage.html">Lorem ipsum dolor sit amet</a>
```
The HyperLinkParser returns the URI https://www.mydomain.com/images/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpage.html instead

Example no. 2

The crawler crawls the page https://www.mydomain.com/pages/
- The following base tag is present on the page
```
<base href="subpages/">
```
- The following link is present on the page
```
<a href="subpage.html">Lorem ipsum dolor sit amet</a>
```
The HyperLinkParser returns the URI https://www.mydomain.com/pages/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpages/subpage.html instead

sjdirect commented 3 years ago

Is this handled this pr correctly?

thedeedawg commented 3 years ago

Yes, closing.

sjdirect / abot

Base URI logic of the HyperLinkParser doesn't respect terminated relative base tag values #231

Description

Example no. 1

Example no. 2