Closed ryanrain2016 closed 5 years ago
I thought that you can do what you want to do in pipelines.py
of your scrapy project, when the spider finishes, it would call the mehtod close_spider(self, spider)
, so you can add some code under this method to let you know the spider has finished.
@zamzus Thanks for your reply! This answer can solve this problem to a certain extent. But when two or more jobs for the same spider are scheduled parallelly, the pipelines.py has no way to know which job it is in. I want to build a service to monitor every spider job status. As jobs increased, using loop is too expensive.
@ryanrain2016,
When running in scrapyd
spiders' __init__
receive the _jobid
keyword argument
and their processes also have the SCRAPY_JOB
env variable set to the jobid.
Another way is to write a custom launcher.
Closing as duplicate of #55 and/or #199.
@Digenis spiders' __init__
receive _job
keyword argument not _jobid
. I added the callback mechanism using MongoDB's changeStream
feature. @ryanrain2016, you can look at this link
It can also be done using redis streams I guess. Didn't look at it since I am using MongoDB to store my crawled items. Also thanks to Digenis for your comment, helped me solve the issue :+1:
I need to know when the spider finishes. For now I can only use a loop to query the status until it finishes. I asked if there is some way to pass a callback url to scapyd when schedule a job, when the job finished scrapy can call the call url to let me know it has finished.