sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 558 forks source link

Base URI logic of the HyperLinkParser doesn't respect terminated relative base tag values #231

Closed thedeedawg closed 3 years ago

thedeedawg commented 3 years ago

Description

As it stands, there seems to be an assumption that a base tag is only ever valid if it contains an absolute value. That is far from the case and goes against the defined standard. The value should be respected even if it's a relative value, as long as it's terminated (trailing slash) such that it can be built upon.

A PR is in the works.


Example no. 1

  1. The crawler crawls the page https://www.mydomain.com/images/
    • The following base tag is present on the page
      <base href="/pages/">
    • The following link is present on the page
      <a href="subpage.html">Lorem ipsum dolor sit amet</a>
  2. The HyperLinkParser returns the URI https://www.mydomain.com/images/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpage.html instead

Example no. 2

  1. The crawler crawls the page https://www.mydomain.com/pages/
    • The following base tag is present on the page
      <base href="subpages/">
    • The following link is present on the page
      <a href="subpage.html">Lorem ipsum dolor sit amet</a>
  2. The HyperLinkParser returns the URI https://www.mydomain.com/pages/subpage.html, but would be expected to have returned https://www.mydomain.com/pages/subpages/subpage.html instead
sjdirect commented 3 years ago

Is this handled this pr correctly?

thedeedawg commented 3 years ago

Yes, closing.