scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.06k stars 513 forks source link

Splash + Scrapy - #130

Closed ericvalente closed 7 years ago

ericvalente commented 9 years ago

I am using Splash with Scrapy to render pages with Javascript.

It is all working fine and returning all the relevant fields:

    {'date': '12-02-2014',
     'description': u'ITEM',
     'month': u'December',
     'price': [u'549.00'],
     'productid': [u'171111236709'],
     'productid2': '1352322',
     'saleprice': [u'449'],
     'url': 'http://1.1.1.1:8050/render.html?wait=10&images=0&html=1&url=http://www.url.com',
     'year': '2014'}

In Scrapy, I have it set to delay=.5 , concurrent_requests=10.

After about 5 minutes of scraping, the javascript element I am scraping (saleprice) no longer returns for any items scraped:

    {'date': '12-02-2014',
     'description': u'ITEM',
     'month': u'December',
     'price': [u'549.00'],
     'productid': [u'13535351351'],
     'productid2': '235352',
     'saleprice': [ ],
     'url': 'http://1.1.1.1:8050/render.html?wait=10&images=0&html=1&url=http://www.url.com',
     'year': '2014'}

    {'date': '12-02-2014',
     'description': u'ITEM2',
     'month': u'December',
     'price': [u'449.00'],
     'productid': [u'64654654'],
     'productid2': '79877',
     'saleprice': [ ],
     'url': 'http://1.1.1.1:8050/render.html?wait=10&images=0&html=1&url=http://www.url.com',
     'year': '2014'}

The element is definitely on the page when I navigate to it in a browser. The rest of the elements are all visible without javascript, so they return fine. I tried using both render.html and render.json and increased the wait time to 10, which seems to be the maximum.

I ran top on the Ubuntu 12.04 instance (with Docker container) and cpu is high, but not 100%. I also tried different concurrent_request amounts to make sure I was not overloading the Splash engine.

Any thoughts as to why over time, the javascript stops rendering properly?

ericvalente commented 9 years ago

Is there a reason why wait=10 is the maximum? As splash renders 10+ concurrent threads, it seems the rendertime goes above 11 seconds:

2014-12-02 16:37:56.006401 [stats] {"maxrss": 228224, "load": [0.17, 0.07, 0.12], "fds": 67, "qsize": 0, "rendertime": 11.102681875228882, "active": 18, "path": "/render.html", "args": {"images": ["0"],

And I think that javascript element is not finished loading after 11 seconds. The site is in China and typically loads slowly.

kmike commented 7 years ago

I'm not sure what the problem was; wait is not the same as timeout; there is now https://github.com/scrapy-plugins/scrapy-splash library to integrate Splash and Scrapy.