How can I run multiple spiders at once on a server?

YPCrumble commented 7 years ago

Thanks for building this great library! I'm using Scrapyd to deploy a host of scrapers that crawl partner websites each day to email out links to their content.

This question has been asked a couple times on StackOverflow:

http://stackoverflow.com/questions/10801093/run-multiple-scrapy-spiders-at-once-using-scrapyd http://stackoverflow.com/questions/11390888/running-multiple-spiders-using-scrapyd?noredirect=1&lq=1

I'm running my spiders daily using cron. Right now I have to make a separate call to schedule.json for each spider. I'm deploying dozens of spiders so this means that I need to add a new line to my crontab for every new spider I deploy. Ideally I would be able to run schedule.json with a setting to simply run all spiders.

First, is this currently possible? (I would be happy to add documentation to more appropriately explain the feature). I haven't seen the ability reading the API documentation or looking through the code.

Second, is it a feature that maintainers would be interested in considering? My initial thought is to add schedule_all.json to the api which would run all spiders in a project concurrently. Please let me know any reactions to this? I didn't see any conversation in either open or closed issues; I apologize in advance if I missed that this has been discussed before.

Thanks again for building this great library!

Digenis commented 7 years ago

... with a setting to simply run all spiders.

Scrapyd would have to internally get all the projects and all spiders from each projects, then run them. This can also be done if you write a script that uses the listprojects and listspiders webservices and then schedule them with the schedule webservice. Doing this in scrapyd instead of a script wouldn't be significantly more efficient but you can write a custom webservice if it's a common use case for you.

callumCarsnip commented 5 years ago

Hi, I have a solution which can be used with cron/bash. This requires 'jq' in order to parse the JSON. Hopefully, this might be helpful for someone until they have an internal solution.

SPIDERS=$(curl http://localhost:6800/listspiders.json?project=your_project | jq -r '.spiders' | jq '.[]' | jq -r); for i in $SPIDERS; do curl http://localhost:6800/schedule.json -d project=your_project -d spider=$i; done

And this below can be used in case you need to easily cancel all the created jobs...

JOBS=$(curl http://localhost:6800/listjobs.json?project=your_project | jq -r '.running' | jq '.[]' | jq '.id' | jq -r); for i in $JOBS; do curl http://localhost:6800/cancel.json -d project=your_project -d job=$i; done

my8100 commented 5 years ago

A python script would make it much more readable and maintainable.

jpmckinney commented 3 years ago

Closing as this can be done by a script that interacts with the available APIs, as described above. Given limited maintainer capacity, I think it is best to focus on features that have no good alternative implementation outside Scrapyd.

scrapy / scrapyd

How can I run multiple spiders at once on a server? #196