scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.94k stars 571 forks source link

Add an option to remove pending jobs at startup #347

Closed my8100 closed 1 year ago

my8100 commented 5 years ago

remove_pending_jobs = off

codecov[bot] commented 5 years ago

Codecov Report

Merging #347 into master will decrease coverage by 0.2%. The diff coverage is 42.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #347      +/-   ##
==========================================
- Coverage   68.37%   68.16%   -0.21%     
==========================================
  Files          17       17              
  Lines         860      867       +7     
  Branches      104      106       +2     
==========================================
+ Hits          588      591       +3     
- Misses        242      245       +3     
- Partials       30       31       +1
Impacted Files Coverage Δ
scrapyd/poller.py 77.77% <42.85%> (-8.43%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3ff7c1c...2ef4de9. Read the comment docs.

my8100 commented 5 years ago

@Digenis How about this PR?

Digenis commented 5 years ago

Are you sure this is not something that cannot be accomplished by the init/systemd script? E.g. putting the databases in a /run/ directory which is cleaned up before running. I think deciding to discard scheduled runs is a responsibility of whatever scheduled them. However, I see usefulness in this, eg when recovering from a period of service unavailability. Instead of accumulating multiple pending jobs with adjacent crawling scopes¹ users may implement the downtime-compensating extension of the crawling scope in the spider itself. The web is messy, you often can't separate the logic of planing the crawl and performing it. Do you have other use-cases for motivation?

Footnote ¹ Sorry for the terminology, I had to make up some terms that may be unidiomatic. By crawling scope I mean definable limits in a spider's domain (not domain name). E.g. Crawling for price comparison: the scope may be limited by product categories. Crawling blogs: the scope may be limited to the last 24h published material and the adjacent scope for which the spider will compensate is from 48 to 24 hours ago.

my8100 commented 5 years ago
  1. remove_pending_jobs() is executed after update_projects() in class QueuePoller, so there's no need to worry about the dbs_dir.

  2. Some users may just want to remove all pending jobs by restarting Scrapyd.

  3. I think it would be useful if we can provide command-line interface, e.g. --remove_pending_jobs, --stop_running_jobs, and --list_config. But it seems that we have to write a twistd plugin to add subcommands to the twistd command.

Digenis commented 5 years ago

Perhaps the twisted plugin is unavoidable because of #70.

Again, are you sure you need this? In the current state of scrapyd it takes only a line or two in an init script.

If you want to set a promise that later or customized versions of the poller that may use something more complex that sqlite files must provide this method too you'll want to add it to the poller's interface.

jpmckinney commented 1 year ago

Closing as not related to a feature request issue.

PR is missing an update to the IPoller interface.

Also, as Digenis mentions, for the default queue class, you can instead just delete the files in the dbs_dir.