Closed opensemanticsearch closed 4 years ago
Seems response.headers returned by scrapy (<class 'scrapy.http.headers.Headers'>) can not be converted automatically to Celery / RabbitMQ message queue, since now (or sometimes?) other format(s):
Exemple of response.headers:
{b'Date': [b'Tue, 19 Nov 2019 11:01:54 GMT'], b'Server': [b'Apache/2.4.38 (Debian)'], b'Vary': [b'Accept-Encoding'], b'Content-Type': [b'text/html;charset=UTF-8']}
Error:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/kombu/serialization.py", line 50, in _reraise_errors yield File "/usr/lib/python3/dist-packages/kombu/serialization.py", line 221, in dumps payload = encoder(data) File "/usr/lib/python3/dist-packages/kombu/utils/json.py", line 69, in dumps dict(default_kwargs, kwargs)) File "/usr/lib/python3.7/json/init.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.7/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.7/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) TypeError: keys must be str, int, float, bool or None, not bytes
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.7/dist-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.7/dist-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/usr/local/lib/python3.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
The time and modification time headers (for the case all headers more than max value of MQ only this used headers) will now be decoded to a MQ compatible format by https://github.com/opensemanticsearch/open-semantic-etl/commit/f6f757aae23f0b74429697eb7ebe92a9a2969bda
On newer installations (not sure if because upgrade to Debian 10, Python or scrapy crawler framework) crawling a website fails, because scrapy returns headers in other format.