postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
800 stars 109 forks source link

Following redirects #56

Open ZackMattor opened 7 years ago

ZackMattor commented 7 years ago

Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!

Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
  spider.every_redirect_page do |page|
    spider.visit_hosts << URI.parse(page.location).host
    spider.enqueue page.location
  end
end
ZackMattor commented 7 years ago

Seems to throw an error if the location is "index.html" or similar...

postmodern commented 7 years ago

Is the error coming from spidr or your code example? page.location grabs the Location header which may not always be absolute. Maybe try page.to_absolute(page.location)?

chamnap commented 7 years ago

Probably should add to README.

postmodern commented 2 years ago

Spidr should automatically follow redirects so the above code is redundant. The Page#each_url method converts everything yielded by Page#each_link to an absolute URL. Page#each_link in turn calls Page#each_redirect, which checks for the Location header. If you manually use page.location, it may not also be an absolute URL, so you'll need to call page.to_absolute(page.location).

I might consider adding Page#redirect_urls or Page#location_urls which would return absolute URLs for convenience.