Open daVinciCEB opened 6 years ago
any solution so far?
i'm afraid I've tried it all..no luck so far
I am using crawlera and splash instance on scrapinghub and am having the same issue. 50%-ish of crawlera requests timeout for seemingly no reason. Changing the speed of requests and/or number of requests has no effect on the issue in my experience...
Having same issues as @johndavidsimmons and getting pretty frustrated. Seems like this service is just being left to die which is odd given all the marketing push saying "No need to manage your own scraping and proxy infrastructure ... use ScrapingHub and pay us a whole bunch of money ..."
Has anyone found a solution? If so, would you mind sharing?
urghh, still no solution for this?
I would recommend opening a support ticket from https://app.zyte.com/ -> Contact support
I was using Zyte's Splash instance with Zyte's proxy, and I found that I could access HTTP URLs without any problem, but I could only access HTTPS pages with the proxy server's port 8010. Ports 8011 (which the documentation suggested) and 8014 (which a tech support person suggested) did not work, but the tech support person's example which worked was using port 8010, which worked. (All three ports worked fine for HTTP pages.)
However -- your problem may be that a request to "https://google.com/" produces 33 different web requests, and if you are using the Zyte (Crawlera) proxy, it will pause 12 seconds between web requests, so your single GET will take 33*12 seconds, which is far above the 60 second timeout. Try your example with a GET which contains no other web requests, like "https://duckduckgo.com/p103.js". My browser shows only 3 web requests made when I do that, so it should be under the timeout.
Problem
I've been looking at using Splash to render JS-centric pages for scraping.
I am also using Crawlera as a proxy so that I don't have to worry about getting banned from pages.
Unfortunately, these two services do not work together at all and only return timeout errors, regardless of how high I increase the timeout.
This is extremely problematic as it means that I cannot use both services together, even on something as simple as trying to scrape from google.com.
Any help would be appreciated here!
Example Code
The following is my Python code that performs a post request against a Splash instance running in docker on my machine:
And here is the Lua script that I am using, it is the exact one from the example that ScrapingHub provides: