scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.1k stars 512 forks source link

Proxy not being used in Splash or scrapy-splash #927

Open fkhan6601 opened 5 years ago

fkhan6601 commented 5 years ago

I have a proxy running on localhost:8090 that works with Selenium. I am trying to get Splash to work and the proxy is not being used at all. When the proxy is running, I can see all traffic through it. By setting scrapy to proxy traffic, I can see the ip Splash is running on, so I know it works. I need Splash to proxy traffic through it so I can the external page. Setting the proxy does not seem to work in any way.

  1. Using Splash through the browser at port 8050 in a docker container, per the docs, renders the page, but no traffic goes through proxy and page renders when the proxy is not running:

    function main(splash, args)
    splash:on_request(function(request)
    request:set_proxy{
      host = "127.0.0.1",
      port = 8090,
      username = "",
      password = "",
      type = "HTTP"
    }
    end
    )
    assert(splash:go(args.url))
    assert(splash:wait(0.5))
    
    return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
    }
    end
  2. Using the a lua script with scrapy, the page renders with or without the proxy running: spider.py:

    def start_requests(self):
        script = """
            function main(splash, args)
    
                assert(splash:go(args.url))
                assert(splash:wait(0.5))
                splash:on_request(function(request)
                    request:set_proxy{
                        host = "127.0.0.1",
                        port = 8090,
                        username = "",
                        password = "",
                        type = "HTTP"
                    }
                end
                )
    
                return {
                    html = splash:html(),
                    png = splash:png(),
                    har = splash:har(),
             }
            end
            """
        req = SplashRequest("http://mysite/home", self.log_in,
                            endpoint='execute', args={'lua_source': script})
        # req.meta['proxy'] = 'http://127.0.0.1:8090'
        yield req

    settings.py:

    
    SPLASH_URL = 'http://0.0.0.0:8050'
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    DOWNLOAD_DELAY = 0.5
    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 2
    AUTOTHROTTLE_TARGET_CONCURRENCY = 1
    SPLASH_COOKIES_DEBUG = True
    COOKIES_ENABLED = True
    COOKIES_DEBUG = True

CONCURRENT_REQUESTS_PER_DOMAIN = 1

CONCURRENT_REQUESTS_PER_IP = 1

#####################################################################

BOT_NAME = 'recspider'

SPIDER_MODULES = ['recspider.spiders'] NEWSPIDER_MODULE = 'recspider.spiders'

DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', 'Accept-Language': 'en', }

Enable or disable spider middlewares

See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 'recspider.middlewares.RecspiderSpiderMiddleware': 543, }

Enable or disable downloader middlewares

See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,

'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,

# 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
'recspider.middlewares.RecspiderDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
# 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
# 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,

}

3. Using a proxy-profile-path, I get status 502. I only tried this once, though:

file located at ~/documents/proxy-profile:

[proxy]

; required host=127.0.0.1 port=8090

shell:

docker run -it -p 8050:8050 -v ~/Documents/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles



Does proxy feature not work?

I have confirmed the proxy can be used by Firefox, Chrome, Selenium (with both browsers), and scrapy. I need requests to go scrapy->Splash->proxy->website. It will only work scrapy->proxy->Splash->website.
Sengxian commented 5 years ago

I think the proxy feature works properly. In Docker containers by default, 127.0.0.1 is not the localhost in your machine. Your proxy is running outside of the container, so you may used host.docker.internal to access your proxy in container.

rubmz commented 3 years ago

ScrapySplash + proxy profiles == headache!

It would be very nice if anyone could provide some simple example with one proxy ip + what should be set in the scrapysplash request args['proxy'] ?!

I hate the guessing game as it takes a long time. So if not an example, please put some proper docs?

Thanks for the great plugin nevertheless.

hfxben commented 1 year ago

I think the proxy feature works properly. In Docker containers by default, 127.0.0.1 is not the localhost in your machine. Your proxy is running outside of the container, so you may used host.docker.internal to access your proxy in container.

Where is 'host.docker.internal' to be set up?