Closed YPCrumble closed 3 years ago
... with a setting to simply run all spiders.
Scrapyd would have to internally get all the projects and all spiders from each projects, then run them. This can also be done if you write a script that uses the listprojects and listspiders webservices and then schedule them with the schedule webservice. Doing this in scrapyd instead of a script wouldn't be significantly more efficient but you can write a custom webservice if it's a common use case for you.
Hi, I have a solution which can be used with cron/bash. This requires 'jq' in order to parse the JSON. Hopefully, this might be helpful for someone until they have an internal solution.
SPIDERS=$(curl http://localhost:6800/listspiders.json?project=your_project | jq -r '.spiders' | jq '.[]' | jq -r); for i in $SPIDERS; do curl http://localhost:6800/schedule.json -d project=your_project -d spider=$i; done
And this below can be used in case you need to easily cancel all the created jobs...
JOBS=$(curl http://localhost:6800/listjobs.json?project=your_project | jq -r '.running' | jq '.[]' | jq '.id' | jq -r); for i in $JOBS; do curl http://localhost:6800/cancel.json -d project=your_project -d job=$i; done
A python script would make it much more readable and maintainable.
Closing as this can be done by a script that interacts with the available APIs, as described above. Given limited maintainer capacity, I think it is best to focus on features that have no good alternative implementation outside Scrapyd.
Thanks for building this great library! I'm using Scrapyd to deploy a host of scrapers that crawl partner websites each day to email out links to their content.
This question has been asked a couple times on StackOverflow:
http://stackoverflow.com/questions/10801093/run-multiple-scrapy-spiders-at-once-using-scrapyd http://stackoverflow.com/questions/11390888/running-multiple-spiders-using-scrapyd?noredirect=1&lq=1
I'm running my spiders daily using cron. Right now I have to make a separate call to
schedule.json
for each spider. I'm deploying dozens of spiders so this means that I need to add a new line to my crontab for every new spider I deploy. Ideally I would be able to run schedule.json with a setting to simply run all spiders.First, is this currently possible? (I would be happy to add documentation to more appropriately explain the feature). I haven't seen the ability reading the API documentation or looking through the code.
Second, is it a feature that maintainers would be interested in considering? My initial thought is to add
schedule_all.json
to the api which would run all spiders in a project concurrently. Please let me know any reactions to this? I didn't see any conversation in either open or closed issues; I apologize in advance if I missed that this has been discussed before.Thanks again for building this great library!