scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.98k stars 569 forks source link

can you add some callback Mechanisms when finish a job? #293

Closed ryanrain2016 closed 5 years ago

ryanrain2016 commented 6 years ago

I need to know when the spider finishes. For now I can only use a loop to query the status until it finishes. I asked if there is some way to pass a callback url to scapyd when schedule a job, when the job finished scrapy can call the call url to let me know it has finished.

zamzus commented 6 years ago

I thought that you can do what you want to do in pipelines.py of your scrapy project, when the spider finishes, it would call the mehtod close_spider(self, spider), so you can add some code under this method to let you know the spider has finished.

ryanrain2016 commented 6 years ago

@zamzus Thanks for your reply! This answer can solve this problem to a certain extent. But when two or more jobs for the same spider are scheduled parallelly, the pipelines.py has no way to know which job it is in. I want to build a service to monitor every spider job status. As jobs increased, using loop is too expensive.

Digenis commented 5 years ago

@ryanrain2016, When running in scrapyd spiders' __init__ receive the _jobid keyword argument and their processes also have the SCRAPY_JOB env variable set to the jobid.

Another way is to write a custom launcher.

Closing as duplicate of #55 and/or #199.

manashmandal commented 5 years ago

@Digenis spiders' __init__ receive _job keyword argument not _jobid. I added the callback mechanism using MongoDB's changeStream feature. @ryanrain2016, you can look at this link

It can also be done using redis streams I guess. Didn't look at it since I am using MongoDB to store my crawled items. Also thanks to Digenis for your comment, helped me solve the issue :+1: