scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.09k stars 513 forks source link

Splash Only Returns Timeout When Using Crawlera #770

Open daVinciCEB opened 6 years ago

daVinciCEB commented 6 years ago

Problem

I've been looking at using Splash to render JS-centric pages for scraping.

I am also using Crawlera as a proxy so that I don't have to worry about getting banned from pages.

Unfortunately, these two services do not work together at all and only return timeout errors, regardless of how high I increase the timeout.

This is extremely problematic as it means that I cannot use both services together, even on something as simple as trying to scrape from google.com.

Any help would be appreciated here!

Example Code

The following is my Python code that performs a post request against a Splash instance running in docker on my machine:

import requests

def get_lua_proxy_script():
    with open('proxied_splash_request.lua') as f:
        proxy_script = ''.join(f.readlines())
        return proxy_script

def get_json_args(url, lua_source, timeout):
    return {
        'url': url,
        'lua_source': lua_source,
        'timeout': timeout
    }

print('Reading Lua Script...')
proxy_script = get_lua_proxy_script()
print('Lua Script Read!')
print('Creating JSON Arguments...')
json_args = get_json_args('https://google.com/', proxy_script, 300.0)
print('JSON Arguments Created!')

splash_url = 'http://localhost:8050/execute'
print('Starting Post Request...')
response = requests.post(splash_url, json=json_args)
print('Post Request Completed!')
print(response.text)

And here is the Lua script that I am using, it is the exact one from the example that ScrapingHub provides:

function use_crawlera(splash)
    -- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
    -- Have a look at the file spiders/quotes-js.py to see how to do it.
    -- Find your Crawlera credentials in https://app.scrapinghub.com/
    local user = '<CRAWLERA-API-KEY>'

    local host = 'proxy.crawlera.com'
    local port = 8010
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        -- The commented code below can be used to speed up the crawling
        -- process. They filter requests to undesired domains and useless
        -- resources. Uncomment the ones that make sense to your use case
        -- and add your own rules.

        -- Discard requests to advertising and tracking domains.
        if string.find(request.url, 'doubleclick%.net') or
           string.find(request.url, 'analytics%.google%.com') then
            request.abort()
            return
        end

        -- Avoid using Crawlera for subresources fetching to increase crawling
        -- speed. The example below avoids using Crawlera for URLS starting
        -- with 'static.' and the ones ending with '.png'.
        if string.find(request.url, '://static%.') ~= nil or
           string.find(request.url, '%.png$') ~= nil then
            return
        end

        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_header(session_header, session_id)
        request:set_proxy{host, port, username=user, password=''}
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
    splash:go(splash.args.url)
    return splash:html()
end
chexenia commented 5 years ago

any solution so far?

Gallaecio commented 5 years ago

Have you seen https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash-scrapy ?

chexenia commented 5 years ago

i'm afraid I've tried it all..no luck so far

johndavidsimmons commented 5 years ago

I am using crawlera and splash instance on scrapinghub and am having the same issue. 50%-ish of crawlera requests timeout for seemingly no reason. Changing the speed of requests and/or number of requests has no effect on the issue in my experience...

amcquistan commented 4 years ago

Having same issues as @johndavidsimmons and getting pretty frustrated. Seems like this service is just being left to die which is odd given all the marketing push saying "No need to manage your own scraping and proxy infrastructure ... use ScrapingHub and pay us a whole bunch of money ..."

teocns commented 3 years ago

Has anyone found a solution? If so, would you mind sharing?

psdon commented 3 years ago

urghh, still no solution for this?

aurishhammadhafeez commented 3 years ago

I would recommend opening a support ticket from https://app.zyte.com/ -> Contact support

ccady commented 3 years ago

I was using Zyte's Splash instance with Zyte's proxy, and I found that I could access HTTP URLs without any problem, but I could only access HTTPS pages with the proxy server's port 8010. Ports 8011 (which the documentation suggested) and 8014 (which a tech support person suggested) did not work, but the tech support person's example which worked was using port 8010, which worked. (All three ports worked fine for HTTP pages.)

However -- your problem may be that a request to "https://google.com/" produces 33 different web requests, and if you are using the Zyte (Crawlera) proxy, it will pause 12 seconds between web requests, so your single GET will take 33*12 seconds, which is far above the 60 second timeout. Try your example with a GET which contains no other web requests, like "https://duckduckgo.com/p103.js". My browser shows only 3 web requests made when I do that, so it should be under the timeout.