Below is our sample spider. All I am trying to do is crawl a product url and print its product title.
Note: The issue is there with all the product pages from this site and not in particular with the above mentioned page. However, Splash is able to render the full html of the below category page from the same site.
from apple.commonfunctions import format_review_count
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urlparse import urljoin
from scrapy.loader import ItemLoader
from scrapy_splash import SplashRequest
Could anybody please help understand why Splash would not render this https://www.flipkart.com/apple-iphone-6s-space-grey-64-gb/p/itmebysg5kgxugfk?pid=MOBEBY3VTD7ZHZQA at all?
Any help will be greatly appreciated.
Below is our sample spider. All I am trying to do is crawl a product url and print its product title.
Note: The issue is there with all the product pages from this site and not in particular with the above mentioned page. However, Splash is able to render the full html of the below category page from the same site.
Category page link [which splash is able to render fine] - https://www.flipkart.com/mobiles/apple~brand/pr?sid=tyy,4io&otracker=product_breadCrumbs_Apple+Mobiles
Spider Code:
import scrapy import re import os.path
from flip.items import AppleItem
from apple.commonfunctions import format_review_count
from scrapy.linkextractors import LinkExtractor from scrapy.linkextractors.sgml import SgmlLinkExtractor from scrapy.spiders import CrawlSpider, Rule from urlparse import urljoin from scrapy.loader import ItemLoader from scrapy_splash import SplashRequest
class FlipSpider(CrawlSpider): name = "flip" allowed_domains = [] start_urls = [ "https://www.flipkart.com/apple-iphone-6s-space-grey-64-gb/p/itmebysg5kgxugfk?pid=MOBEBY3VTD7ZHZQA" ]
def init(self, timeStamp='', outputFolder='', _args, *_kwargs):
def start_requests(self): for url in self.start_urls:
yield SplashRequest(url, self.parse_start_url, endpoint = 'render.html', args = {'wait': 0.5} )
def parse_start_url(self, response): print "inside parse_detail_page" print "Product Title = " + response.xpath('//h1[contains(@class,"_3eAQiD")]/text()').extract()
Splash related settings that I have in my settings.py is below.
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'