stewartmckee / cobweb

Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
MIT License
227 stars 45 forks source link

Relative urls are handled proper #45

Closed peric closed 8 years ago

peric commented 8 years ago

Presence of @base_url (by default, it is '') is not something that should rewrite crawled @url.

But, if @base_url is available, then we first need to do join_no_fragment with it, and afterwards also with @url.

In that case, when we have, for example:

@url = https://www.github.com
@base_url = /assets
link = image/awesome_image.png

The result (after UriHelper.join_no_fragment(@url, UriHelper.join_no_fragment(@base_url, link))) will be:

link = 'https://www.github.com/assets/awesome_image.png

stewartmckee commented 8 years ago

Interesting, had always though the base href should be absolute, but relatives are also allowed too.

thanks, Stewart.