scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

Decode spider logs self.body return none error #374

Closed ghost closed 5 years ago

ghost commented 5 years ago
from kafka import KafkaConsumer
from frontera.contrib.backends.remote.codecs.msgpack import Encoder as MsgPackEncoder, Decoder as MsgPackDecoder
from frontera.core.models import Request, Response

dec = MsgPackDecoder(Request, Response)

consumer = KafkaConsumer(
   'frontier-done',
   bootstrap_servers=['localhost:9092'],
   auto_offset_reset='smallest',
   value_deserializer=lambda x: dec.decode(x))

for message in consumer:
   message = message.value
   print('{}'.format(message))
   print('\n\n\n')
Traceback (most recent call last):
  File "temp.py", line 25, in <module>
    print('{}'.format(message))
  File "/opt/anaconda3/lib/python3.7/site-packages/frontera/core/models.py", line 168, in __str__
    str(self.body[:20]), str(self.headers))
TypeError: 'NoneType' object is not subscriptable

I have modified the frontera/core/models.py

def __str__(self):
    if self.body is None:
        self._body = ''
    return "<%s at 0x%0x %s %s meta=%s body=%s... headers=%s>" % (type(self).__name__,
                                                                  id(self), self.status_code,
                                                                  self.url, str(self.meta),
                                                                  str(self.body[:20]), str(self.headers))

Because sometime the self.body return none, but I am not sure this is the best way to solve the problem.

sibiryakov commented 5 years ago

You're overwriting body with empty string, which is not the same as None. So may be a better way is to create a local variable body = None if body is not None else body[:20] and use it in final result

ghost commented 5 years ago

@sibiryakov I think this is a bug, what do you think?

Why do I overwrite the body with an empty string is that it should be an empty string.

https://github.com/scrapinghub/frontera/blob/5762a2658ee1d8c57a8edb3aa0cf4ac8df1b7a4e/frontera/core/models.py#L100

You're overwriting body with empty string, which is not the same as None. So may be a better way is to create a local variable body = None if body is not None else body[:20] and use it in final result

It cannot be self._body = None if self.body is not None else self.body[:20], because if not None mean the body contain strings, but you override the strings with None, and cause the None pass to body[:20].

Traceback (most recent call last):
  File "temp.py", line 44, in <module>
    print(message)
  File "/opt/anaconda3/lib/python3.7/site-packages/frontera/core/models.py", line 168, in __str__
    self._body = None if self.body is not None else self.body[:20]
TypeError: 'NoneType' object is not subscriptable

I think should be str(self.body[:20]) if self.body is not None else None and i have made a PR. https://github.com/scrapinghub/frontera/pull/375/commits/13efd27cf8bbd7e2d1fdf24b05ba4524df11fe57

Gallaecio commented 5 years ago

Fixed by #375