scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

ModuleNotFoundError: No module named 'frontera.contrib.scrapy.middlewares.seeds' #371

Open ghost opened 5 years ago

ghost commented 5 years ago

@sibiryakov Hi, thanks your suggestion about the kafka. But i have installed it in my pc. I tend to build kafka+hbase crawler.

I have few questions, first when i run this command python -m frontera.utils.add_seeds --config tutorial.config.dbw --seeds-file seeds.txt

scrapy crawl tutorial -L INFO -s SPIDER_PARTITION_ID=0 i got this error ModuleNotFoundError: No module named 'frontera.contrib.scrapy.middlewares.seeds' Screenshot from 2019-06-06 23-37-04

after i removed, i can run the scrapy, but 0 page crawled SPIDER_MIDDLEWARES = { 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 999, ̶ ̶ ̶ ̶'̶f̶r̶o̶n̶t̶e̶r̶a̶.̶c̶o̶n̶t̶r̶i̶b̶.̶s̶c̶r̶a̶p̶y̶.̶m̶i̶d̶d̶l̶e̶w̶a̶r̶e̶s̶.̶s̶e̶e̶d̶s̶.̶f̶i̶l̶e̶.̶F̶i̶l̶e̶S̶e̶e̶d̶L̶o̶a̶d̶e̶r̶'̶:̶ ̶1̶,̶ } Screenshot from 2019-06-06 23-52-23

besides, my kafka didnt consume any message Screenshot from 2019-06-06 23-53-04

All my config is followed the document cluster setup guide.

For the kafka problems. after i add this line MESSAGE_BUS = 'frontera.contrib.messagebus.kafkabus.MessageBus' and remove ̶ ̶ ̶ ̶'̶f̶r̶o̶n̶t̶e̶r̶a̶.̶c̶o̶n̶t̶r̶i̶b̶.̶s̶c̶r̶a̶p̶y̶.̶m̶i̶d̶d̶l̶e̶w̶a̶r̶e̶s̶.̶s̶e̶e̶d̶s̶.̶f̶i̶l̶e̶.̶F̶i̶l̶e̶S̶e̶e̶d̶L̶o̶a̶d̶e̶r̶'̶:̶ ̶1̶,̶ i got this problem when i start db worker, stategic work and crawler. Screenshot from 2019-06-07 17-14-38

my config common.py

from __future__ import absolute_import
from frontera.settings.default_settings import MIDDLEWARES

MAX_NEXT_REQUESTS = 512
SPIDER_FEED_PARTITIONS = 2 # number of spider processes
SPIDER_LOG_PARTITIONS = 2 # worker instances
MIDDLEWARES.extend([
    'frontera.contrib.middlewares.domain.DomainMiddleware',
    'frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware'
])

QUEUE_HOSTNAME_PARTITIONING = True
KAFKA_LOCATION = 'localhost:9092' 
URL_FINGERPRINT_FUNCTION='frontera.utils.fingerprint.hostname_local_fingerprint'
MESSAGE_BUS = 'frontera.contrib.messagebus.kafkabus.MessageBus'
SPIDER_LOG_TOPIC = 'frontier-done'
SPIDER_FEED_TOPIC = 'frontier-todo'
SCORING_TOPIC = 'frontier-score'

dbw.py

from __future__ import absolute_import
from .worker import *
LOGGING_CONFIG='logging-db.conf' 

spider.py

from __future__ import absolute_import
from .common import *
BACKEND = 'frontera.contrib.backends.remote.messagebus.MessageBusBackend'
KAFKA_GET_TIMEOUT = 0.5
LOCAL_MODE = False  # by default Frontera is prepared for single process mode

sw.py

from __future__ import absolute_import
from .worker import *
CRAWLING_STRATEGY = 'frontera.strategy.basic.BasicCrawlingStrategy' # path to the crawling strategy class
LOGGING_CONFIG='logging-sw.conf' # if needed

worker.py

from __future__ import absolute_import
from .common import *
BACKEND = 'frontera.contrib.backends.hbase.HBaseBackend'
HBASE_DROP_ALL_TABLES = True
MAX_NEXT_REQUESTS = 2048
NEW_BATCH_DELAY = 3.0
HBASE_THRIFT_HOST = 'localhost' # HBase Thrift server host and port
HBASE_THRIFT_PORT = 9090

how i create kafka topic kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic frontier-done kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic frontier-todo kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 2 --topic frontier-score

I set the partition to 2 in common.py,

SPIDER_FEED_PARTITIONS = 2 # number of spider processes
SPIDER_LOG_PARTITIONS = 2 # worker instances

how i start kafka kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic frontier-done --from-beginning kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic frontier-todo --from-beginning kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic frontier-score --from-beginning

Version of tools Name: frontera Version: 0.8.1 Name: Scrapy Version: 1.6.0 Name:Python Version:3.7.3 Name:Kafka Version:2.2.1

I think may be the doc didnt update to v0.8.1, it still stay at v0.8.0.1. Should i downgrade the frontera to the table version v0.8? But myself love to use the latest version instead.

Thanks in advance!

Gallaecio commented 5 years ago

Please, use StackOverflow to ask this type of questions. See also https://stackoverflow.com/help/mcve and https://stackoverflow.com/help/how-to-ask

ghost commented 5 years ago

@Gallaecio I have think before to ask this question in stackoverflow, but there is less responsive than here. I believe that there are bugs involved. Have you read my entire problem?

Also, I checked all the previous questions, @sibiryakov is very responsive to solve the problem, this is why I am asking here.

I will try to ask in stackoverflow...

Screenshot from 2019-06-07 18-55-22

I have uploaded the question to stackoverflow, https://stackoverflow.com/questions/56493245/modulenotfounderror-no-module-named-frontera-contrib-scrapy-middlewares-seeds

Sorry i have not enough reputation to post image in stackoverflow. but i use i used imgur.com instead. Hope i can get the answer soon..,

ghost commented 5 years ago

@sibiryakov I found a solution for this error

  File "/home/liho/anaconda3/lib/python3.7/site-packages/frontera/contrib/messagebus/kafkabus.py", line 60, in __init__
    self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]
TypeError: 'NoneType' object is not iterable

You should add this line

            self._consumer.topics()

before

            self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]

Seems like partitions_for_topic does not request a metadata refresh, whereas topics does. No clue why this worked in kafka-python 1.4.4, as it seems the two functions have not changed. Maybe metadata was always refreshed asap when creating the consumer in 1.4.4?

Making partitions_for_topic call the same code as topics before returning the partitions seems to solve the problem obviously.

Have a look they are fixing this problem recently https://github.com/dpkp/kafka-python/issues/1789 https://github.com/dpkp/kafka-python/pull/1781 https://github.com/dpkp/kafka-python/issues/1774 https://github.com/Yelp/kafka-utils/pull/216/commits/607a5770b45d7abf41a5351c6575582e78064195

ghost commented 5 years ago

@sibiryakov After i successfully start the cluster

python -m frontera.worker.db --config tutorial.config.dbw --no-incoming --partitions 0 1
python -m frontera.worker.strategy --config tutorial.config.sw --partition-id 0

When i inject the seeds file by command below,

python -m frontera.utils.add_seeds --config tutorial.config.sw --seeds-file seeds.txt

i got this error in the meanwhile in db worker terminal Screenshot from 2019-06-09 17-59-27 But after the seed injected, it gone...

scrapy crawl tutorial -L INFO -s SPIDER_PARTITION_ID=1

But i still get 0 page crawled... Screenshot from 2019-06-09 17-51-09

Pls help me when you are free sir, thanks in advance!

sibiryakov commented 5 years ago

Hi @liho00 your seeds weren't injected, because the strategy worker was unable to create the table crawler:queue. Check that it can connect to Hbase Thrift Server, and namespace crawler exists.

ghost commented 5 years ago

@sibiryakov Hi, I am sure i have created the namespace crawler before, and i am also sure the queue table was created..., i need to clarify that im using frontera v0.8.1 as the 'frontera.contrib.scrapy.middlewares.seeds' has been removed at this version.

Screenshot from 2019-06-10 13-52-54

after i tried again the error still show up after key in this command

python -m frontera.utils.add_seeds --config tutorial.config.sw --seeds-file seeds.txt

dbw terminal Screenshot from 2019-06-10 16-53-07

But after few second it show the seeds injected?

seeds terminal Screenshot from 2019-06-10 16-53-11

I am still getting 0 page crawled

Besides that, can you tell me how to inject the seeds? If this module is not needed,

ModuleNotFoundError: No module named 'frontera.contrib.scrapy.middlewares.seeds'

i should inject the seed into my strategic worker?

Lastly, i cannot force close my crawler, it trapped in an endless loop Screenshot from 2019-06-10 17-01-51

My kafka, zookeeper, hbase, hadoop all started, Screenshot from 2019-06-10 17-05-05 Screenshot from 2019-06-10 17-03-38

ghost commented 5 years ago

solved by downgrade kafka-python to v1.4.4

Gallaecio commented 5 years ago

If that’s the only fix, then we need to either update setup.py accordingly or add support for later versions of kafka-python.

sibiryakov commented 5 years ago

@Gallaecio it should be a tiny PR https://github.com/scrapinghub/frontera/issues/371#issuecomment-500197551

ghost commented 5 years ago

Besides that, I cannot force close the spiders, it trapped in an endless loop [kafka client] warning unable to send to wakeup socket when using kafka-python v1.4.5 and v1.4.6 (latest).

kafka/client_async.py

            except socket.error:
                log.warning('Unable to send to wakeup socket!')

https://github.com/dpkp/kafka-python/issues/1837 https://github.com/dpkp/kafka-python/issues/1842

Screenshot from 2019-06-10 17-01-51

psdon commented 3 years ago

I also get the same problem. How can we solve this?

yenicelik commented 3 years ago

Getting the same issue here