salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Support for starting path property for Crawler #83

Closed nickgeorgiou closed 4 years ago

nickgeorgiou commented 5 years ago

Description Crawler currently always begins crawling from the root of the domain specified in the domain configuration property. Sometimes it is useful to begin crawling a site from a sub-page/path. The crawler would start with that page so that pages linked from there would appear at the top of the list of URLs

Proposed solution Provide a configuration property e.g. starting_path that allows someone to specify a path from which to begin crawling, rather than always starting to crawl from the / root page.

stooit commented 4 years ago

I actually have a requirement where there are multiple entry points for crawling.

e.g site that contains micro-sites or orphaned content, not always accessible from a single entry point. Can we also factor that into this effort?