Open lljrsr opened 8 years ago
Hey @lljrsr long time no see ;) Try
$scrapy crawl [your spider] -s SPIDER_PARTITION_ID=[number]
it should work, because it's possible to configure Frontera using Scrapy settings (see docs for more details.)
Hi. Yes I had lots of other stuff to do :) .
scrapy crawl [my spider] -s FRONTERA_SETTINGS=[my project].frontier.spider_settings -s SPIDER_PARTITION_ID=0
..does not work. It throws:
exceptions.TypeError: int() argument must be a string or a number, not 'NoneType'
..when trying to use the partition_id
Option values isn't passing. Well, can you investigate that? The same thing without -s FRONTERA_SETTINGS ?
Yes, it throws the same error when I use:
scrapy crawl [my spider] -s SPIDER_PARTITION_ID=0
My guess is that there is a difference between scrapy settings (e.g. SEEDS_SOURCE
, FRONTERA_SETTINGS
) and frontera settings (e.g. ZMQ_HOSTNAME
, SPIDER_PARTITION_ID
) and it is not possible to pass frontera settings.
EDIT I found out that you use a the scrapy settings class e.g. in this file. In this file for example you use a frontera settings class. (I just added print settings
after those lines to compare them)
It's connected with this https://github.com/scrapinghub/frontera/pull/105
With the newest update it now uses the correct SPIDER_PARTITION_ID
in messagebus.py
. However it still throws an error (but a different one):
...
File "/home/jrisr/Crawl/debug/frontera/frontera/core/manager.py", line 24, in __init__
self._backend = self._load_backend(backend, db_worker, strategy_worker)
File "/home/jrisr/Crawl/debug/frontera/frontera/core/manager.py", line 62, in _load_backend
return cls.from_manager(self)
File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/backends/remote/messagebus.py", line 28, in from_manager
return clas(manager)
File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/backends/remote/messagebus.py", line 21, in __init__
self.consumer = spider_feed.consumer(partition_id=self.partition_id)
File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/messagebus/zeromq/__init__.py", line 179, in consumer
return Consumer(self.context, self.out_location, partition_id, 'sf', seq_warnings=True, hwm=self.consumer_hwm)
File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/messagebus/zeromq/__init__.py", line 21, in __init__
filter = identity + pack('>B', partition_id) if partition_id is not None else identity
struct.error: cannot convert argument to integer
This is probably because self.partition_id
of MessageBusBackend
is a string when passing it via the command line option.
Should be fine now. Please reopen in case of problems.
Passing the settings via command line works now, but the settings.py
takes precedence over the command line options, which should not be the case according to scrapy docs.
I would like to reopen this issue but either I do not know how or I am not able to :P .
FRONTERA_SETTINGS
module isn't connected with Scrapy anyhow, so Frontera's settings have precedence.
http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#frontier-scrapy-settings
Okay. So the FRONTERA_SETTINGS
have precedence over all the scrapy settings (including the command line settings). In my opinion it would be a good idea to mention that in the docs.
However I think this is a strange design. Command line settings usually have the highest priority since it provides an easy way for a user to try out some values, before storing them in a config file.
It is in the docs: http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#defining-frontier-settings-via-scrapy-settings
Frontera is designed in a way to be used independently from Scrapy, so it happened historically Frontera has it's own settings.
At the moment settings in Scrapy evolved, and it's possible to designate which of them are set using command line, therefore prioritizing cmd line over FRONTERA_SETTINGS
can be done, and I think makes sense.
Right now frontera recommends setting the PARTITION_ID in a separate python settings file for each spider / worker. However when shipping out the project it would be nice to have a command line option to pass either a config file or the PARTITION_ID of the worker/spider. The separate settings file would then no longer be needed, which would make frontera more flexible and easier to deploy and use. Since supporting config files might need big changes in the project I recommend adding a command line option to choose the PARTIOTION_ID. Do you think that would be a good addition? Is there already something available, so this feature would not be needed?