salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

URL uniqueness cleanup #68

Closed stooit closed 4 years ago

stooit commented 5 years ago

Description URL uniqueness was introduced in #65

It should be refactored such that it is a standalone class that can be used by both Crawler and Scraper.

Proposed solution The logic for determining the domainless url (e.g getDomainlessUrl) should be the same for both - and the ability to track unique URLs should be possible in either crawl or scrape.

MigrateCrawlQueue.php should have the add() fn updated to support the standard mechanism for adding domainless urls, as well as add support the new include_query and include_fragment options.

Tests should be written to ensure crawler and scraper are able to determine unique URLs correctly.

Additional context None

derklempner commented 5 years ago

Duplicate content urls are now detected in latest feature/issue-74-spider-cache.