salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Allow file paths to provide URI lists #99

Closed steveworley closed 4 years ago

steveworley commented 4 years ago

Description The crawler addition adds the ability to scan a website for paths the output of this command creates a new file. The current process expects us to copy the URIs output in the file to the content configuration file, this quickly becomes unwieldy it would be cleaner to add the file generated by the crawler and have the configuration parser pull the URI list from file.

Proposed solution

If a file is used it might be a good approach to lazy load the URIs so we don't need to load the entire file into memory to process.

Additional context

urls:
  - file://path/to/uri-list.json
  - /one-off-path.html
stooit commented 4 years ago

This is complete, possible by providing urls_file which points to one or multiple outputs from the crawler.

Docs: https://salsadigitalauorg.github.io/merlin-framework/docs/url-options#urls-in-separate-files