jakopako / goskyr

A configurable command-line web scraper written in go with auto configuration capability
GNU General Public License v3.0
33 stars 5 forks source link

Add another type of paginator based on external list of URLs #244

Closed alucab closed 11 months ago

alucab commented 11 months ago

Use case I have a site with nice sitemap.xml file. I can use goskyr to extract a wonderful json of all the pages i want to crawl. But now I cannot pass this information back to the tool.

I know that I might create a config.yml with all the URLs but that is not friendly and I suspect that creating thousands of scrapers would kill the system

As a different functionality maybe i would consider to read the yml file with pool of scrapers so that executing an yml with 5000 scrapers is scalable.

I will experiment but i'm no go expert

alucab commented 11 months ago

I solved it using on on_subpage

writer: type: file filepath: sitemap.json scrapers: