It should be refactored such that it is a standalone class that can be used by both Crawler and Scraper.
Proposed solution
The logic for determining the domainless url (e.g getDomainlessUrl) should be the same for both - and the ability to track unique URLs should be possible in either crawl or scrape.
MigrateCrawlQueue.php should have the add() fn updated to support the standard mechanism for adding domainless urls, as well as add support the new include_query and include_fragment options.
Tests should be written to ensure crawler and scraper are able to determine unique URLs correctly.
Description URL uniqueness was introduced in #65
It should be refactored such that it is a standalone class that can be used by both Crawler and Scraper.
Proposed solution The logic for determining the domainless url (e.g
getDomainlessUrl
) should be the same for both - and the ability to track unique URLs should be possible in either crawl or scrape.MigrateCrawlQueue.php
should have theadd()
fn updated to support the standard mechanism for adding domainless urls, as well as add support the newinclude_query
andinclude_fragment
options.Tests should be written to ensure crawler and scraper are able to determine unique URLs correctly.
Additional context None