Closed dportabella closed 5 years ago
Hi David! That's a good catch. I guess the reason for this is that in the default case, the canonical form of a URL is the SURT form: https://github.com/helgeho/Web2Warc/blob/master/src/main/scala/de/l3s/web2warc/crawling/components/CrawlStrategyDef.scala#L31 . Thus, the protocol (http vs https) is ignored and the redirect location is the same as the source.
The easiest way to "fix" this is to use simply the original URL as canonical form, like so:
Web2Warc.strategyDef.canonicalUrl = url => url
Can you please check if this solves your problem? I am considering to make this the default behavior for the next verison.
Yes, this fixed the issue, thx!
Hi, while this workaround fixed the issue, disabling the canonical url function does not look like a optimal solution. is there a better solution?
here, the seeds contains only one page, which is a "301 Moved Permanently" page. even if followRedirects is set to true, the crawl stops here. it works if I replace the seed url with a 200 page (then it crawls with depth 5, as expected).