salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
17 stars 3 forks source link

Allow resume (Crawler) #127

Open stooit opened 4 years ago

stooit commented 4 years ago

Description If a crawl is interrupted it needs to start over. The cache makes this very fast as it passes through cached content, but it would be better if it simply resumed from where it left off.

Proposed solution Keep a copy of MigrateCrawlQueue urls and pendingUrls in a lockfile. If the same config is run again ask the user if they would like to resume.

Additional context Currently moderately painful when doing very large sites (e.g millions of pages)