scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.92k stars 569 forks source link

feat: Add spiderqueue configuration option #476

Closed jpmckinney closed 1 year ago

jpmckinney commented 1 year ago

closes #197

See #475 for updating the default spider queue.

codecov[bot] commented 1 year ago

Codecov Report

Merging #476 (040cc67) into master (68cb43c) will decrease coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
- Coverage   87.31%   87.30%   -0.01%     
==========================================
  Files          41       41              
  Lines        1876     1875       -1     
==========================================
- Hits         1638     1637       -1     
  Misses        238      238              
Flag Coverage Δ
unittests 87.30% <100.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
scrapyd/jobstorage.py 100.00% <100.00%> (ø)
scrapyd/spiderqueue.py 95.45% <100.00%> (+0.21%) :arrow_up:
scrapyd/tests/test_spiderqueue.py 100.00% <100.00%> (ø)
scrapyd/utils.py 89.25% <100.00%> (+0.08%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pspsdev commented 1 year ago

Probably not the correct solution, as different types of queues might need more arguments like redis_host, redis_port etc not just the DB path. I'm wondering what would be the quickest and simplest solution to quickly improve scrapyd performance for those who run high polling rates. Maybe add support :memory: as dbpath so sqlite runs in memory? Currently the code makes it always a file based solution but scrapyd/spiderqueue.py defaults to :memory: correctly but that can never happen because upstream code always sets a file.

jpmckinney commented 1 year ago

Ah, good call, we need to move the dbs_dir parsing to the queue's initialization.

For Redis, PostgreSQL, etc. we can have dbs_dir still be a single string, e.g. a connection string like redis://user:pass@host:port/database or postgresql://user:pass@host:port/database, etc.

Edit: Or, we can pass the full config object to the queues. As-is, it's not out of the question for alternative queues to just read their configuration from the environment, rather than from the config file.

jpmckinney commented 1 year ago

Currently the code makes it always a file based solution but scrapyd/spiderqueue.py defaults to :memory: correctly but that can never happen because upstream code always sets a file.

I would say that its use of :memory: is incorrect, as it will cause all projects to use the same DB, which is not the intent. The connection string would need to be something like file:project1?mode=memory https://www.sqlite.org/inmemorydb.html

Edit: Nevermind, misread some other documentation:

Every :memory: database is distinct from every other. So, opening two database connections each with the filename ":memory:" will create two independent in-memory databases.

jpmckinney commented 1 year ago

Okay, custom queues should be more extensible with the latest commit.

pspsdev commented 1 year ago

@jpmckinney this looks good. What else need to be done to get it merged?