helgeho / Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
MIT License
24 stars 4 forks source link

followRedirects = true does not work on seeds #5

Closed dportabella closed 5 years ago

dportabella commented 6 years ago

here, the seeds contains only one page, which is a "301 Moved Permanently" page. even if followRedirects is set to true, the crawl stops here. it works if I replace the seed url with a 200 page (then it crawls with depth 5, as expected).

  Web2Warc.writer.path = "temp.warc.gz"
  Web2Warc.spec.maxLevel = 5
  Web2Warc.spec.followRedirects = true
  Web2Warc.spec.increaseLevelOnRedirect = false
//  Web2Warc.seeds = Set("https://epfl.ch")   // this crawler crawls until depth 5.
  Web2Warc.seeds = Set("http://epfl.ch")  // the crawler only archives this page
  Web2Warc.run()
helgeho commented 6 years ago

Hi David! That's a good catch. I guess the reason for this is that in the default case, the canonical form of a URL is the SURT form: https://github.com/helgeho/Web2Warc/blob/master/src/main/scala/de/l3s/web2warc/crawling/components/CrawlStrategyDef.scala#L31 . Thus, the protocol (http vs https) is ignored and the redirect location is the same as the source.

The easiest way to "fix" this is to use simply the original URL as canonical form, like so: Web2Warc.strategyDef.canonicalUrl = url => url

Can you please check if this solves your problem? I am considering to make this the default behavior for the next verison.

dportabella commented 6 years ago

Yes, this fixed the issue, thx!

dportabella commented 5 years ago

Hi, while this workaround fixed the issue, disabling the canonical url function does not look like a optimal solution. is there a better solution?