scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

don't canonicalize URLs before passing them to URL_FINGERPRINT_FUNCTION #335

Open kmike opened 6 years ago

kmike commented 6 years ago

Currently frontera passes an URL to URL_FINGERPRINT_FUNCTION which is already canonicalized by w3lib's canonicalize_url function. By changing the API to pass raw URL users will be able to use canonicalize_url options like remove_fragments=False (which can be desired e.g. for Splash), or swap canonicalize implementation altogether. This would be backwards incompatible, though if desired it can be made backwards compatible (use a different settings, etc).

sibiryakov commented 6 years ago

Currently frontera passes an URL to URL_FINGERPRINT_FUNCTION which is already canonicalized by w3lib's canonicalize_url function Only if URL comes from Scrapy's link extractor with canonicalisation enabled. Also there is a create_request method which is used when adding new seeds, generating new urls in CS and discovery from sitemap. In other words this is not always true.

There is probably a better way to manage canonicalisation through the whole pipeline: a dedicated middleware. http://frontera.readthedocs.io/en/latest/topics/frontier-canonicalsolvers.html