scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

when i used json,something wrong happend #297

Closed EchoShoot closed 12 months ago

EchoShoot commented 6 years ago

2017-10-07 20:33:46 [kafka.coordinator] INFO: Discovered coordinator 0 for group fetchers-spider-feed 2017-10-07 20:33:51 [messagebus-backend] INFO: Consuming from partition id 0 2017-10-07 20:33:51 [manager] INFO: Frontier Manager Started! 2017-10-07 20:33:51 [manager] INFO: -------------------------------------------------------------------------------- 2017-10-07 20:33:51 [frontera.contrib.scrapy.schedulers.FronteraScheduler] INFO: Starting frontier 2017-10-07 20:33:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) Jumping into debugger for post-mortem of exception 'value must be bytes':

/usr/local/lib/python3.6/site-packages/kafka/protocol/message.py(42)init() -> assert value is None or isinstance(value, bytes), 'value must be bytes' (Pdb) value '["dict", [[["other", "type"], ["other", "offset"]], [["other", "partition_id"], ["other", 0]], [["other", "offset"], ["other", 0]]]]' (Pdb)

problem happened when i set (MESSAGE_BUS_CODEC = 'frontera.contrib.backends.remote.codecs.json')

sibiryakov commented 6 years ago

Hi @EchoShoot could you provide a stack trace of exception?

anjackson commented 6 years ago

I'm seeing the same issue:

INFO:strategy-worker:Seeds addition started from url file:///Users/andy/Documents/workspace/huntsman/huntsman/seeds.txt
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/worker/strategy.py", line 391, in <module>
    worker.run(seeds_url)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/worker/stats.py", line 45, in run
    super(StatsExportMixin, self).run(*args, **kwargs)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/worker/strategy.py", line 258, in run
    self.add_seeds(seeds_url)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/worker/strategy.py", line 224, in add_seeds
    strategy.read_seeds(fh)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/strategy/basic.py", line 10, in read_seeds
    self.schedule(r)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/strategy/__init__.py", line 122, in schedule
    self._scheduled_stream.send(request, score, dont_queue)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/core/manager.py", line 790, in send
    self._producer.send(None, encoded)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/frontera/contrib/messagebus/kafkabus.py", line 103, in send
    self._producer.send(self._topic, value=msg)
  File "/Users/andy/Documents/workspace/huntsman/vev/lib/python3.6/site-packages/kafka/producer/kafka.py", line 552, in send
    assert type(value_bytes) in (bytes, bytearray, memoryview, type(None))
AssertionError

The Kafka library appears to be expecting bytes but the JSON codec emits a str.

amitsing89 commented 5 years ago

It is an version issue,for python 3 you can use encode to utf-8

for example -- result = producer.send('topic-sentiments', objString.encode('utf-8'))

MuqadderIqbal commented 5 years ago

So I got my sample code to work using the encode('utf-8') option as suggested by @amitsing89 but is this problem ever going to be fixed within the package itself at some point? Or has Python 3 introduced some change/feature (I believe with regards to how strings are handled by default) that this package (and all others that use it, for example: kafka-python) will force users to perform the utf-encoding on ever single message sent to Kafka from an application?