scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

Prioritize command line option for SPIDER_PARTITION_ID #107

Open lljrsr opened 8 years ago

lljrsr commented 8 years ago

Right now frontera recommends setting the PARTITION_ID in a separate python settings file for each spider / worker. However when shipping out the project it would be nice to have a command line option to pass either a config file or the PARTITION_ID of the worker/spider. The separate settings file would then no longer be needed, which would make frontera more flexible and easier to deploy and use. Since supporting config files might need big changes in the project I recommend adding a command line option to choose the PARTIOTION_ID. Do you think that would be a good addition? Is there already something available, so this feature would not be needed?

sibiryakov commented 8 years ago

Hey @lljrsr long time no see ;) Try

$scrapy crawl [your spider] -s SPIDER_PARTITION_ID=[number]

it should work, because it's possible to configure Frontera using Scrapy settings (see docs for more details.)

lljrsr commented 8 years ago

Hi. Yes I had lots of other stuff to do :) .

scrapy crawl [my spider] -s FRONTERA_SETTINGS=[my project].frontier.spider_settings -s SPIDER_PARTITION_ID=0

..does not work. It throws:

exceptions.TypeError: int() argument must be a string or a number, not 'NoneType'

..when trying to use the partition_id

sibiryakov commented 8 years ago

Option values isn't passing. Well, can you investigate that? The same thing without -s FRONTERA_SETTINGS ?

lljrsr commented 8 years ago

Yes, it throws the same error when I use:

scrapy crawl [my spider] -s SPIDER_PARTITION_ID=0

My guess is that there is a difference between scrapy settings (e.g. SEEDS_SOURCE, FRONTERA_SETTINGS) and frontera settings (e.g. ZMQ_HOSTNAME, SPIDER_PARTITION_ID) and it is not possible to pass frontera settings.

EDIT I found out that you use a the scrapy settings class e.g. in this file. In this file for example you use a frontera settings class. (I just added print settings after those lines to compare them)

sibiryakov commented 8 years ago

It's connected with this https://github.com/scrapinghub/frontera/pull/105

lljrsr commented 8 years ago

With the newest update it now uses the correct SPIDER_PARTITION_ID in messagebus.py. However it still throws an error (but a different one):

...
  File "/home/jrisr/Crawl/debug/frontera/frontera/core/manager.py", line 24, in __init__
    self._backend = self._load_backend(backend, db_worker, strategy_worker)
  File "/home/jrisr/Crawl/debug/frontera/frontera/core/manager.py", line 62, in _load_backend
    return cls.from_manager(self)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/backends/remote/messagebus.py", line 28, in from_manager
    return clas(manager)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/backends/remote/messagebus.py", line 21, in __init__
    self.consumer = spider_feed.consumer(partition_id=self.partition_id)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/messagebus/zeromq/__init__.py", line 179, in consumer
    return Consumer(self.context, self.out_location, partition_id, 'sf', seq_warnings=True, hwm=self.consumer_hwm)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/messagebus/zeromq/__init__.py", line 21, in __init__
    filter = identity + pack('>B', partition_id) if partition_id is not None else identity
struct.error: cannot convert argument to integer

This is probably because self.partition_id of MessageBusBackend is a string when passing it via the command line option.

sibiryakov commented 8 years ago

https://github.com/scrapinghub/frontera/pull/110

sibiryakov commented 8 years ago

Should be fine now. Please reopen in case of problems.

lljrsr commented 8 years ago

Passing the settings via command line works now, but the settings.py takes precedence over the command line options, which should not be the case according to scrapy docs. I would like to reopen this issue but either I do not know how or I am not able to :P .

sibiryakov commented 8 years ago

FRONTERA_SETTINGS module isn't connected with Scrapy anyhow, so Frontera's settings have precedence. http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#frontier-scrapy-settings

lljrsr commented 8 years ago

Okay. So the FRONTERA_SETTINGS have precedence over all the scrapy settings (including the command line settings). In my opinion it would be a good idea to mention that in the docs. However I think this is a strange design. Command line settings usually have the highest priority since it provides an easy way for a user to try out some values, before storing them in a config file.

sibiryakov commented 8 years ago

It is in the docs: http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#defining-frontier-settings-via-scrapy-settings

Frontera is designed in a way to be used independently from Scrapy, so it happened historically Frontera has it's own settings. At the moment settings in Scrapy evolved, and it's possible to designate which of them are set using command line, therefore prioritizing cmd line over FRONTERA_SETTINGS can be done, and I think makes sense.