istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Adding support for issue #182 #190

Open knirbhay opened 6 years ago

knirbhay commented 6 years ago

Adding support for

1.Custom Headers and Cookies with Initial request 2.Shared cookies middleware to share cookies between crawl nodes

Linked Issue #182

knirbhay commented 6 years ago

Custom feed will look like this

python kafka_monitor.py feed { "url": "http://dmoztools.net", "appid": "testapp", "crawlid": "ABC123", "spiderid": "myspiderid", "headers": { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,ima/webp,/;q=0.8", "Accept-Encoding": "gzip, deflate", "X-Requested-With": "dmoztools.net", "User-Agent": "My Custom User Agent" }, "cookies": { "device_id": "1", "app_token": "guid" } }

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.4%) to 69.99% when pulling 1cd7940d7b547dc4c50d60330e852afc644f8473 on knirbhay:dev into 2c2075a2b7f4b4bb4e7213ec12c9cf6133ac6fdc on istresearch:dev.

knirbhay commented 6 years ago

Due to shared cookie middle ware the coverage has decreased by 0.4%. @madisonb do you think this can be managed? I improved few percentage in distributed_scheduler.