istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Support for custom header and cookies for the initial request from kafka_monitor.py feed #182

Open knirbhay opened 6 years ago

knirbhay commented 6 years ago

I needed to request an URL with custom header and preset cookies. eg.

There is an API at https://xyz.com/test_api/_id which returns a json. and this should be called with api keys with custom header and few preset cookies in the request with a POST call.

How do I get it working with scrapy-cluster?

With Scrapy I used to override the start_request method and apply custom header and cookies.

Another problem looks like cookie jar issue where cookies are stored on one node and can not be passed to another node. This is activated when Server uses Set-Cookies method to store session details.

madisonb commented 6 years ago

Cookie support is already provided thanks to the cookie field in the kafka monitor api. This is a cookie string that is then deserialized into a scrapy cookie object.

As for custom request methods, the custom scheduler is where you want to look, as that translates the objects coming in into Scrapy Requests. I think the scheduler should be able to handle Post requests being yielded from the spider, thanks to the scrapy dict methods, but on initial request that is something that could be improved on.

Scrapy Cluster purposefully does not store cookie information in each spider, because any single chain of requests might go to multiple spiders or machines. You would need to customize the setup a bit to pass those cookies through your calls so they are used in subsequent requests.

Scrapy Cluster is most suited for large scale on demand crawling, and in its current form (because it is distributed) has some of the limitations or assumptions I noted above. I am always happy to look at or review a PR if you think it would be worthwhile to add to the project!

knirbhay commented 6 years ago

Working towards it. I made the custom request working with headers and cookies. Working on shared cookie instances, shared via redis seperated by crawl/spider ids

knirbhay commented 6 years ago

Below custom cookie middleware worked for me. Not sure if this is a right place to initiate redis_conn. Could not find a way to share DistributedScheduler redis_conn.

` import redis import pickle

from scrapy.downloadermiddlewares.cookies import CookiesMiddleware

class SharedCookiesMiddleware(CookiesMiddleware):

def __init__(self, debug=True, server=None):
    CookiesMiddleware.__init__(self, debug)
    self.redis_conn = server
    self.debug = debug

@classmethod
def from_crawler(cls, crawler):
    server = redis.Redis(host=crawler.settings.get('REDIS_HOST'),
                         port=crawler.settings.get('REDIS_PORT'),
                         db=crawler.settings.get('REDIS_DB'))
    return cls(crawler.settings.getbool('COOKIES_DEBUG'), server)

def process_request(self, request, spider):
    if 'dont_merge_cookies' in request.meta:
        return
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)

    cookies = self._get_request_cookies(jar, request)
    for cookie in cookies:
        jar.set_cookie_if_ok(cookie, request)

    # set Cookie header
    request.headers.pop('Cookie', None)
    jar.add_cookie_header(request)

    self._debug_cookie(request, spider)
    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))

def process_response(self, request, response, spider):
    if request.meta.get('dont_merge_cookies', False):
        return response
    cookiejarkey = "{spiderid}:sharedcookies:{crawlid}".format(
                    spiderid=request.meta.get("spiderid"),
                    crawlid=request.meta.get("crawlid"))

    jar = self.jars[cookiejarkey]
    jar.clear()

    if self.redis_conn.exists(cookiejarkey):
        data = self.redis_conn.get(cookiejarkey)
        jar = pickle.loads(data)

    # extract cookies from Set-Cookie and drop invalid/expired cookies
    jar.extract_cookies(response, request)
    self._debug_set_cookie(response, spider)

    self.redis_conn.set(cookiejarkey, pickle.dumps(jar))
    return response

`

madisonb commented 6 years ago

Thanks @knirbhay! This is a great start, I can try to incorporate this in or just leave it as a standalone file. You can make a PR and I can review it and get it merged in, otherwise there are just a couple of things I would like changed in it, but otherwise is great work.

NirbhayK commented 6 years ago

Sure I will also include custom request working with headers and cookies. I have enhanced Kafka feed API but I also need to check if it works with REST API of Scrapy cluster.