scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.96k stars 569 forks source link

Option to queue/ignore repeated schedule #153

Closed lucaspottersky closed 2 years ago

lucaspottersky commented 8 years ago

This is a feature request.

I wish scrapyd would not run the same Spider at the same time. Maybe this could be a configuration option? I can see 3 behaviours:

I'm asking this because I'm afraid of concurrency problems, since my Spiders write to a file using Feed Exports.

Digenis commented 8 years ago

In practice, you can try checking with listjobs.json if the spider is already scheduled and schedule it only if it's not already. You can theoretically get into a race condition but if you are only concerned about a feed exports file I'd guess your project is not the kind of project under such a risk.

There can be a cleaner solution instead of adding more options to the config. Perhaps we can define a behaviour for jobid collisions like aborting scheduling and then you could come up with a jobid scheme that reserves a "slot". E.g. if your spider is supposed to crawl every 6 hours you would come up with a jobid scheme like this: %Y-%m-%d Q where Q is the quarter of the day (1,2,3,4)

jpmckinney commented 2 years ago

Closing as there has not been additional interest for this feature request since 2016.

Noting that I think it's better to put this logic outside Scrapyd (using its API). I see way too many desired customizations to the scheduling logic (run the repeat crawl after a given interval, auto-schedule crawls, etc.).

Scrapyd is just a basic API for running scrapy crawl. It's not a full-fledged automation server (like Jenkins or similar).

That said, #197 is open, about using a custom queue class.