Assigning job priorty while scheduling

scrapy / scrapyd

A service daemon to run Scrapy spiders

https://scrapyd.readthedocs.io/en/stable/

BSD 3-Clause "New" or "Revised" License

2.96k stars 569 forks source link

Assigning job priorty while scheduling #144

Closed DharmeshPandav closed 8 years ago

DharmeshPandav commented 8 years ago

I am having a use case scenario where we are running new job for each task on same domain I am handling this using scrapyd and schedule.json API like

curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d page_url=some_spdier_start_url

but there are some cases where, new jobs request needs to be served immediately ( soon after current jobs finishes) [i am using single process per domain so have only one process running for any spider at any given time]

can we use priority queuing in scrapyd while scheduling job ? https://github.com/scrapy/queuelib/blob/master/queuelib/pqueue.py

#something like
 curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d priority=1
#priority : {1- highest, 2 - high, 3 - normal, 4 - low, 5 - lowest } , default to 3 - normal

i noticed that this is implemented on Scrapinhub dashboard when we run a spider

DharmeshPandav commented 8 years ago

After looking into code-base and few trial-n-error , I see that priority option is already implemented but argument was not documented

it is implemented in SqlitePriorityQueue class in sqlite.py package and priority argument value was set to 0 by default in add method of SqliteSpiderQueue class in spiderqueue.py package

def add(self, name, **spider_args):
        d = spider_args.copy()
        d['name'] = name
        priority = float(d.pop('priority', 0))
        self.q.put(d, priority)

i think this argument should be included in documentation

should i close this issue and open other for documentation fix ?

Digenis commented 8 years ago

Indeed, it is missing documentation and tests. Also, I'd rather move it to the schedule.json webservice and have it next to the other statements that modify the dictionary by popping "reserved" arguments.

Digenis commented 8 years ago

For 1.1 branch: https://github.com/Digenis/scrapyd/commit/5d068b36924a9df7ec48935407c89b62c4d7394b I'll commit this one so please comment.

For the next release I plan something like https://github.com/Digenis/scrapyd/commit/448284c20bc1b4326d51a7b7489d13e8886ad952 I'll go through a PR for it because it's backwards incompatible and I actually want to change it in an even more backwards incompatible way because passing the spider arguments with **spider_args is an abuse of the keyword arguments feature which leads to colliding argument names (eg try (lambda a, **k: 0)(1, **{'a':2}))

DharmeshPandav commented 8 years ago

yes... for current branch -- mention of priority argument in documentation will suffice for next release - PR will work as you have demonstrated

Digenis commented 8 years ago

~~Merged the doc update to the 1.1 branch.~~ Prepared it in #147. I'll see if there's anything else that should be fixed for 1.1 and make a minor version bump.

Digenis commented 8 years ago

Closing in favour of #147