istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

duplicate logging confusing #149

Closed softwarevamp closed 5 years ago

softwarevamp commented 6 years ago

https://github.com/istresearch/scrapy-cluster/blob/77129294a0764bf279ea6d5fc2ca26b6df7ab513/crawler/crawling/pipelines.py#L198

#                 future2.add_callback(self._kafka_success, datum, spider)
#                 future2.add_errback(self._kafka_failure, datum, spider)

original log entries when SC_LOG_JSON=TRUE:
{"message": "Sent page to Kafka", "crawlid": "5997b408e16c872d68e28aab", "attrs": {"url": "http://www.mytarget.cn/cggg/dfgg/", "_id": "5997b408e16c872d68e28aab", "enabled": true, "name": "\u5730\u65b9\u6807\u8baf"}, "spiderid": "link", "logger": "sc-crawler", "response_url": "http://www.mytarget.cn/cggg/dfgg/?_=1510102443667073", "timestamp": "2017-11-08T00:54:08.446441", "level": "INFO", "url": "http://www.mytarget.cn/cggg/dfgg/?_=1510102443667073", "success": true, "appid": "bulletinboards", "action": "ack"}
{"message": "Sent page to Kafka", "crawlid": "5997b408e16c872d68e28aab", "attrs": {"url": "http://www.mytarget.cn/cggg/dfgg/", "_id": "5997b408e16c872d68e28aab", "enabled": true, "name": "\u5730\u65b9\u6807\u8baf"}, "spiderid": "link", "logger": "sc-crawler", "response_url": "http://www.mytarget.cn/cggg/dfgg/?_=1510102443667073", "timestamp": "2017-11-08T00:54:08.446441", "level": "INFO", "url": "http://www.mytarget.cn/cggg/dfgg/?_=1510102443667073", "success": true, "appid": "bulletinboards", "action": "ack"}

when KAFKA_APPID_TOPICS set to True, _kafka_success or _kafka_failure called twice which generates duplicate log entries.

madisonb commented 6 years ago

That is the point though, because it generates 2 different kafka messages. One for the main kafka topic, and one for the appid kafka topic. Is there different behavior expected here?

softwarevamp commented 6 years ago

Is it reasonable send back conditionally?

if self.appid_topics:
    send to appid
else:
    send to firehose
madisonb commented 6 years ago

You are free to make a fork and change the behavior if you want, but the point of the current setup is to make a topic that "everyone" can subscribe to, and a topic that "only X app" can subscribe to.