scrapy-splash recursive crawl using CrawlSpider not working

dijadev commented 7 years ago

Hi !

I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:

 def process_request(self,request):
        request.meta['splash']={
            'args': {
                # set rendering arguments here
                'html': 1,
            }
        }
        return request

The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

wattsin commented 7 years ago

I also have this issue.

NORMAL REQUEST - it will follow the rules and Follow=True yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin)

USING SPLASH - it will only visit the first url yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin, meta={ 'splash': { 'endpoint': 'render.html', 'args': { 'wait': 0.5 } } })

dijadev commented 7 years ago

Has someone found the solution ?

wattsin commented 7 years ago

i have not. unfortunately

On Fri, Jan 27, 2017 at 1:10 PM -0500, "dijadev" notifications@github.com wrote:

Has someone found the solution ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

amirj commented 7 years ago

I have the same problem, any solution?

wattsin commented 7 years ago

Negative.

brianherbert commented 7 years ago

+1 over here. Encountering the same issue as described by @wattsin.

dwj1324 commented 7 years ago

I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    ...

However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won't have any requests to follow.

komuher commented 7 years ago

+1

ghost commented 7 years ago

+1

hieu-n commented 7 years ago

@dwj1324

I tried to debug my spider with PyCharm and set a breakpoint at if not isinstance(response, HtmlResponse):. That code was never reached when SplashRequest was used instead of scrapy.Request.

What worked for me is to add this to the callback parsing function:

def parse_item(self, response):
    """Parse response into item also create new requests."""

    page = RescrapItem()
    ...
    yield page

    if isinstance(response, (HtmlResponse, SplashTextResponse)):
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = SplashRequest(url=link.url, callback=self._response_downloaded, 
                                              args=SPLASH_RENDER_ARGS)
                r.meta.update(rule=rule, link_text=link.text)
                yield rule.process_request(r)

NingLu commented 7 years ago

+1, any update for this issue?

NingLu commented 7 years ago

@hieu-n i use the code you paste here, and change splash request to request since i need to use the header, but it doesn't work, the spider still crawl the first depth content, any suggestion will be appreciated

hieu-n commented 7 years ago

@NingLu I haven't touched scrapy for a while. In your case, what I would do is to set a few breakpoints and step through your code and the scrapy's code. Good luck!

Goles commented 6 years ago

+1 any updates here?

dijadev commented 6 years ago

Hello everyone ! As @dwj1324 said the CrawlSpider do a response type check in _requests_to_follow function. So I've juste overridden this function to avoid escaping SplashJsonResponse(s):

hope this helps !

tf42src commented 6 years ago

Having the same issue. Have overridden _requests_to_follow as stated by @dwj1324 and @dijadev.

As soon as I start using splash by adding the following code to my spider:

def start_requests(self):
        for url in self.start_urls:
            print('->', url)
            yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

it does not call _requests_to_follow anymore. Scrapy follows links when commenting out that function again.

VictorXunS commented 6 years ago

Hi, I have found a workaround which works for me: Instead of using a scrapy request: yield scrapy.Request(page_url, self.parse_page) simply append this splash prefix to the url: yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page) the localhost port may depend on how you built spalsh docker

reg3x commented 5 years ago

Hi, I have found a workaround which works for me: Instead of using a scrapy request: yield scrapy.Request(page_url, self.parse_page) simply append this splash prefix to the url: yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page) the localhost port may depend on how you built spalsh docker

@VictorXunS this is not working for me, could you share all your CrawlSpider code?

victor-papa commented 5 years ago

Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev and @hieu-n for suggestions.

` def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        for link in links:
            seen.add(link)
            r = self._build_request(n, link)
            yield rule.process_request(r)

`

` def _build_request(self, rule, link):
    r = Request(url=link.url, callback=self._response_downloaded)
    r.meta.update(rule=rule, link_text=link.text)
    return r

`

JavierRuano commented 5 years ago

I am not expert, but scrapy has its own filter, isn't it? (you use not seen)

http://doc.scrapy.org/en/latest/topics/link-extractors.html http://doc.scrapy.org/en/latest/topics/link-extractors.html class scrapy.linkextractors.lxmlhtml.LxmlLinkExtracto

unique (boolean*) – whether duplicate filtering should be applied to extracted links.

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Libre de virus. www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

El lun., 18 feb. 2019 a las 20:17, Nick-Verdegem (notifications@github.com) escribió:

Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev https://github.com/dijadev and @hieu-n https://github.com/hieu-n for suggestions.

` def _requests_to_follow(self, response): seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r)

def _build_request(self, rule, link): r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=rule, link_text=link.text) return r

`

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/92#issuecomment-464848375, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu87plFAMY-qF8MolZsRwKXMp4Imrks5vOvxCgaJpZM4Ku50c .

XamHans commented 5 years ago

Hi @Nick-Verdegem thank you for sharing. My CrawlSPider is still not working with your solution, do you use start_requests?

MontaLabidi commented 5 years ago

So i encountered this issue and solved it by overriding the type check as suggested :

  def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse)):
            return
        ....

but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the 'rule' its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:

  def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',           
        })
        return request

and add it to ur Rule : process_request="use_splash" the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider Hope that helps!

nciefeiniu commented 5 years ago

I use scrapy-splash and scrapy-redis

RedisCrawlSpider can running.

Need to rewrite

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
                'url': url, 'wait': 5, 'lua_source': default_script
            })

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _build_request(self, rule, link):
        # parameter 'meta' is required !!!!!
        r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
                          args={'wait': 5, 'url': link.url, 'lua_source': default_script})
        # Maybe you can delete it here.
        r.meta.update(rule=rule, link_text=link.text)
        return r

Some parameters need to be modified by themselves

sp-philippe-oger commented 5 years ago

@MontaLabidi Your solution worked for me.

This is how my code looks:


class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

digitaldust commented 5 years ago

@sp-philippe-oger could you please show the whole file? In my case the crawl spider won't call the redefined _requests_to_follow and as a consequence still stops after the first page...

sp-philippe-oger commented 5 years ago

@digitaldust pretty much the whole code is there. Not sure what is missing for you to make it work.

digitaldust commented 5 years ago

@sp-philippe-oger don't worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo... thanks!

MSDuncan82 commented 5 years ago

Anyone get this to work while running a Lua script for each pagination?

davisbra commented 5 years ago

@nciefeiniu hi... would you please give more information about integrating scrapy-redis with splash? i mean, how do you send your urls from redis to splash?

zhaicongrong commented 4 years ago

@MontaLabidi Your solution worked for me.

This is how my code looks:

class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

I use python3, but there's an error: _identity_process_request() missing 1 required positional argument. Is there something wrong?

Gallaecio commented 4 years ago

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

janwendt commented 4 years ago

If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests e.g. to bypass cloudflare that's my solution:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest, SplashJsonResponse, SplashTextResponse
from scrapy.http import HtmlResponse

class Abc(scrapy.Item):
    name = scrapy.Field()

class AbcSpider(CrawlSpider):
    name = "abc"
    allowed_domains = ['abc.de']
    start_urls = ['https://www.abc.com/xyz']

    rules = (Rule(LinkExtractor(restrict_xpaths='//h2[@class="abc"]'), callback='parse_item', process_request="use_splash"))

    def start_requests(self):        
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': 15}, meta={'real_url': url})

    def use_splash(self, request):
        request.meta['splash'] = {
                'endpoint':'render.html',
                'args':{
                    'wait': 15,
                    }
                }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def parse_item(self, response):
        item = Abc()
        item['name'] = response.xpath('//div[@class="abc-name"]/h1/text()').get()
        return item

vishalmry commented 3 years ago

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

It does not work, throws an error use_splash() is missing 1 required positional argument: 'response'

Gallaecio commented 3 years ago

@vishKurama Which Scrapy version are you using? Can you share a minimal, reproducible example?

gingergenius commented 3 years ago

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

It does not work, throws an error use_splash() is missing 1 required positional argument: 'response'

I had this problem too. Just use yield rule.process_request(r, response) in the last line of the overridden method

JwanKhalaf commented 3 years ago

I am facing a similar problem and the solutions listed here aren't working for me, unless I've missed something!

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest
import logging

class MainSpider(CrawlSpider):
    name = 'main'
    allowed_domains = ['www.somesite.com']

    script = '''
    function main(splash, args)
      splash.private_mode_enabled = false

      my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'

      headers = {
        ['User-Agent'] = my_user_agent,
        ['Accept-Language'] = 'en-GB,en-US;q=0.9,en;q=0.8',
        ['Referer'] = 'https://www.google.com'
      }

      splash:set_custom_headers(headers)

      url = args.url

      assert(splash:go(url))

      assert(splash:wait(2))

      -- username input
      username_input = assert(splash:select('#username'))
      username_input:focus()
      username_input:send_text('myusername')
      assert(splash:wait(0.3))

      -- password input
      password_input = assert(splash:select('#password'))
      password_input:focus()
      password_input:send_text('mysecurepass')
      assert(splash:wait(0.3))

      -- the login button
      login_btn = assert(splash:select('#login_btn'))
      login_btn:mouse_click()
      assert(splash:wait(4))

      return splash:html()
    end
    '''

    rules = (
        Rule(LinkExtractor(restrict_xpaths="(//div[@id='sidebar']/ul/li)[7]/a"), callback='parse_item', follow=True, process_request='use_splash'),
    )

    def start_requests(self):
        yield SplashRequest(url = 'https://www.somesite.com/login', callback = self.post_login, endpoint = 'execute', args = {
            'lua_source': self.script
        })

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })

        return request

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return

        seen = set()

        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]

            if links and rule.process_links:
                links = rule.process_links(links)

            for link in links:
                seen.add(link)
                r = self._build_request(n, link)

                yield rule.process_request(r)

    def post_login(self, response):
       logging.info('hey from post login!')

       with open('post_login_response.txt', 'w') as f:
           f.write(response.text)
           f.close()

    def parse_item(self, response):
        logging.info('hey from parse_item!')

        with open('post_search_response.txt', 'w') as f:
            f.write(response.text)
            f.close()

The parse_item function is never hit, in the logs, I never see hey from parse_item! but I do see hey from post login. I'm not sure what I'm missing.

InzamamAnwar commented 1 year ago

Following is a working crawler for scraping https://books.toscrape.com. Tested with Scrapy version 2.9.0. For installing and configuring splash, follow the README.


import scrapy
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest, SplashTextResponse, SplashJsonResponse

class FictionBookScrapper(CrawlSpider):
    _WAIT = 0.1

    name = "fiction_book_scrapper"
    allowed_domains = ['books.toscrape.com']
    start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]

    le_book_details = LinkExtractor(restrict_css=("h3 > a",))
    rule_book_details = Rule(le_book_details, callback='parse_request', follow=False, process_request='use_splash')

    le_next_page = LinkExtractor(restrict_css='.next > a')
    rule_next_page = Rule(le_next_page, follow=True, process_request='use_splash')

    rules = (
        rule_book_details,
        rule_next_page,
    )

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': self._WAIT}, meta={'real_url': url})

    def use_splash(self, request, response):
        request.meta['splash'] = {
            'endpoint': 'render.html',
            'args': {
                'wait': self._WAIT
            }
        }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse, SplashJsonResponse)):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def parse_request(self, response: scrapy.http.Response):
        self.logger.info(f'Page status code = {response.status}, url= {response.url}')

        yield {
             'Title': response.css('h1 ::text').get(),
             'Link': response.url,
             'Description': response.xpath('//*[@id="content_inner"]/article/p/text()').get()
         }

scrapy-plugins / scrapy-splash

scrapy-splash recursive crawl using CrawlSpider not working #92