istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Add login in LinkSpider #188

Closed josselinlbe closed 6 years ago

josselinlbe commented 6 years ago

Hello ! i want to create my own spider with the function to simulate an user login. But i don't understand this init:

def __init__(self, *args, **kwargs):
super(LinkSpider, self).__init__(*args, **kwargs)

How to had the login system with LinkSpider ? Here is my beginning code... Do you think I can fit it in?

    # Temp
    login_page = 'https://url/'
    email = 'mail'
    password = 'pwd'

    def init_request(self):
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return FormRequest.from_response(response,
                                         formdata={'session_key': 'email', 'session_password': 'password'},
                                         callback=self.check_login_response)

    def check_login_response(self, response):
        if "Sign Out" in response.body:
            self._logger.debug("Successfully logged in. Let's start crawling!")
            return self.initialized()

        else:
            self._logger.debug("Failed, bad times.")

    def parse(self, response):
        self._logger.debug("crawled url {}".format(response.request.url))
        cur_depth = 0

Thanks :+1:

madisonb commented 6 years ago

Without a persistent distributed cookie cache like as discussed in this comment, your crawls on your logged in page will not have the same cookie token being passed on different requests for a depth > 0.

If you are comfortable having each spider log in upon startup, this may work, however you should look at the cookie implementation in the project to make sure it will work like you expect.

I presume your use case is to have your entire cluster log in to a specific website, and then on-demand crawl pages within the site? It is an interesting use case but would limit this "generalized" project so I will have to think about it to see if it makes sense to have the capability here.

Otherwise, since this is a personal setup and not a bug or issue, can we close this and move to the gitter chat room?

madisonb commented 6 years ago

Closing. I think this is best reserved for a personal/custom setup and not a generic setup as supported by this project.