scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

Splash not rendering this JavaScript page #84

Closed pipkarma closed 4 years ago

pipkarma commented 8 years ago

Could anybody please help understand why Splash would not render this https://www.flipkart.com/apple-iphone-6s-space-grey-64-gb/p/itmebysg5kgxugfk?pid=MOBEBY3VTD7ZHZQA at all?

Any help will be greatly appreciated.

Below is our sample spider. All I am trying to do is crawl a product url and print its product title.

Note: The issue is there with all the product pages from this site and not in particular with the above mentioned page. However, Splash is able to render the full html of the below category page from the same site.

Category page link [which splash is able to render fine] - https://www.flipkart.com/mobiles/apple~brand/pr?sid=tyy,4io&otracker=product_breadCrumbs_Apple+Mobiles

Spider Code:

import scrapy import re import os.path

from flip.items import AppleItem

from apple.commonfunctions import format_review_count

from scrapy.linkextractors import LinkExtractor from scrapy.linkextractors.sgml import SgmlLinkExtractor from scrapy.spiders import CrawlSpider, Rule from urlparse import urljoin from scrapy.loader import ItemLoader from scrapy_splash import SplashRequest

class FlipSpider(CrawlSpider): name = "flip" allowed_domains = [] start_urls = [ "https://www.flipkart.com/apple-iphone-6s-space-grey-64-gb/p/itmebysg5kgxugfk?pid=MOBEBY3VTD7ZHZQA" ]

def init(self, timeStamp='', outputFolder='', _args, *_kwargs):

super(FlipSpider, self).__init__(*args, **kwargs) 

def start_requests(self): for url in self.start_urls:

yield SplashRequest(url, self.parse_start_url, endpoint = 'render.html', args = {'wait': 0.5} )

    yield scrapy.Request(url, self.parse_start_url, meta={'splash':{'endpoint':'render.html','args':{'wait': 0.5,}}})

def parse_start_url(self, response): print "inside parse_detail_page" print "Product Title = " + response.xpath('//h1[contains(@class,"_3eAQiD")]/text()').extract()

Splash related settings that I have in my settings.py is below.

DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }

SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }

SPLASH_URL = 'http://localhost:8050/'

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Gallaecio commented 5 years ago

@pipkarma Could you close this report and open a new one in https://github.com/scrapinghub/splash?