If a crawl's scrape was redirected, the target URL is also locked, preventing it from being scraped again later.
lockURL now generates similar URLs that will presumably return the same content, and it will lock all of them, preventing the same content from being scraped twice. This can be disabled with the deduplicateSimilarURLs crawl parameter (on by default).
This PR adds some important features:
lockURL
now generates similar URLs that will presumably return the same content, and it will lock all of them, preventing the same content from being scraped twice. This can be disabled with thededuplicateSimilarURLs
crawl parameter (on by default).