Open fkhan6601 opened 5 years ago
I think the proxy feature works properly. In Docker containers by default, 127.0.0.1
is not the localhost
in your machine. Your proxy is running outside of the container, so you may used host.docker.internal
to access your proxy in container.
ScrapySplash + proxy profiles == headache!
It would be very nice if anyone could provide some simple example with one proxy ip + what should be set in the scrapysplash request args['proxy'] ?!
I hate the guessing game as it takes a long time. So if not an example, please put some proper docs?
Thanks for the great plugin nevertheless.
I think the proxy feature works properly. In Docker containers by default,
127.0.0.1
is not thelocalhost
in your machine. Your proxy is running outside of the container, so you may usedhost.docker.internal
to access your proxy in container.
Where is 'host.docker.internal' to be set up?
I have a proxy running on localhost:8090 that works with Selenium. I am trying to get Splash to work and the proxy is not being used at all. When the proxy is running, I can see all traffic through it. By setting scrapy to proxy traffic, I can see the ip Splash is running on, so I know it works. I need Splash to proxy traffic through it so I can the external page. Setting the proxy does not seem to work in any way.
Using Splash through the browser at port 8050 in a docker container, per the docs, renders the page, but no traffic goes through proxy and page renders when the proxy is not running:
Using the a lua script with scrapy, the page renders with or without the proxy running: spider.py:
settings.py:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1
#####################################################################
BOT_NAME = 'recspider'
SPIDER_MODULES = ['recspider.spiders'] NEWSPIDER_MODULE = 'recspider.spiders'
DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8', 'Accept-Language': 'en', }
Enable or disable spider middlewares
See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 'recspider.middlewares.RecspiderSpiderMiddleware': 543, }
Enable or disable downloader middlewares
See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
}
[proxy]
; required host=127.0.0.1 port=8090
docker run -it -p 8050:8050 -v ~/Documents/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles