Closed lucaspottersky closed 2 years ago
In practice, you can try checking with listjobs.json if the spider is already scheduled and schedule it only if it's not already. You can theoretically get into a race condition but if you are only concerned about a feed exports file I'd guess your project is not the kind of project under such a risk.
There can be a cleaner solution
instead of adding more options to the config.
Perhaps we can define a behaviour for jobid collisions
like aborting scheduling
and then you could come up with a jobid scheme
that reserves a "slot".
E.g. if your spider is supposed to crawl every 6 hours
you would come up with a jobid scheme like this:
%Y-%m-%d Q
where Q is the quarter of the day (1,2,3,4)
Closing as there has not been additional interest for this feature request since 2016.
Noting that I think it's better to put this logic outside Scrapyd (using its API). I see way too many desired customizations to the scheduling logic (run the repeat crawl after a given interval, auto-schedule crawls, etc.).
Scrapyd is just a basic API for running scrapy crawl
. It's not a full-fledged automation server (like Jenkins or similar).
That said, #197 is open, about using a custom queue class.
This is a feature request.
I wish scrapyd would not run the same Spider at the same time. Maybe this could be a configuration option? I can see 3 behaviours:
I'm asking this because I'm afraid of concurrency problems, since my Spiders write to a file using Feed Exports.