scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

Linking custom strategy #402

Closed b0tf4th3r closed 4 years ago

b0tf4th3r commented 4 years ago

Hi to everyone.

BUG DESCRIPTION

I'm trying to link custom strategy to cluster setup, but it's starting with default 'frontera.strategy.basic.BasicCrawlingStrategy' strategy. I think there is a problem with linking custom strategy in v0.8.1.

SETTINGS & CONF

docker (kafka, zookeper, hbase) ` version: "2" services: zookeeper: image: wurstmeister/zookeeper tmpfs: "/datalog" ports:

crawlFrontier/sworker.py: from __future__ import absolute_import from .worker import * CRAWLING_STRATEGY = 'frontera.strategy.test.TestStrategy' Yes, strategy is in frontera package. Placing strategy locally (in project folder: projectFolder/crawlFrontier/strategies/test.py) doesn't work either.

crawlFrontier/worker.py: from __future__ import absolute_import from .common import * BACKEND = 'frontera.contrib.backends.hbase.HBaseBackend' HBASE_DROP_ALL_TABLES = True HBASE_THRIFT_PORT = 9090 HBASE_THRIFT_HOST = 'localhost' HBASE_METADATA_TABLE = 'metadata' HBASE_QUEUE_TABLE = 'queue' MESSAGE_BUS = 'frontera.contrib.messagebus.kafkabus.MessageBus'

crawlFrontier/common.py: from __future__ import absolute_import from .common import * BACKEND = 'frontera.contrib.backends.hbase.HBaseBackend' HBASE_DROP_ALL_TABLES = True HBASE_THRIFT_PORT = 9090 HBASE_THRIFT_HOST = 'localhost' HBASE_METADATA_TABLE = 'metadata' HBASE_QUEUE_TABLE = 'queue' MESSAGE_BUS = 'frontera.contrib.messagebus.kafkabus.MessageBus'

HOW I START FRONTERA

python -m frontera.worker.db --config crawlFrontier.dbworker --no-incoming --partitions 0 python -m frontera.worker.db --no-batches --config crawlFrontier.dbworker python -m frontera.worker.strategy --config crawlFrontier.sworker --partition-id 0 scrapy crawl general -L INFO -s SPIDER_PARTITION_ID=0 python -m frontera.utils.add_seeds --config crawlFrontier.sworker --seeds-file seeds.txt

b0tf4th3r commented 4 years ago

I fixed it, it's all about: "CRAWLING_STRATEGY = '' # path to the crawling strategy class" In documentation it is named CRAWLING_STRATEGY, but it should be STRATEGY.