scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.92k stars 569 forks source link

How to pause and resume crawling with scrapyd? #483

Closed Akise closed 1 year ago

Akise commented 1 year ago

I'd like to have persistence in my spider project while using scrapyd, but I don't quite know how exactly to do it.

jpmckinney commented 1 year ago

Pausing/resuming is a Scrapy feature, so you can do it as documented here: https://docs.scrapy.org/en/latest/topics/jobs.html#jobs-pausing-and-resuming-crawls

It would involve passing the JOBDIR setting when scheduling the crawl, e.g.

curl http://host/schedule.json -d project=myproject -d spider=somespider -d setting=JOBDIR=/path/to/dir

The cancel endpoint does the same thing as Ctrl-C from Scrapy's documentation (i.e. sending the TERM signal).

Akise commented 1 year ago

@jpmckinney When using cancel, the crawler exited directly, and did not see something similar in the log

2023-05-20 21:28:07 [scrapy.crawler]:328 INFO: Received SIGINT, shutting down gracefully. Send again to force
2023-05-20 21:28:07 [scrapy.core.engine]:406 INFO: Closing spider (shutdown)

. When running again, the jobdir is recreated to overwrite the previous state

I don't know where the problem is

jpmckinney commented 1 year ago

This is a Scrapy issue, not a Scrapyd issue. Scrapyd has no knowledge of Scrapy's JOBDIR features. It lets Scrapy manage the JOBDIR. Scrapyd is essentially just calling scrapy crawl with the spider settings you provide it via -d setting=MYVAR=MYVALUE like in my example command.