psf / requests-html

Pythonic HTML Parsing for Humans™
http://html.python-requests.org
MIT License
13.72k stars 977 forks source link

Make pyppeteer use proxies #266

Open oldani opened 5 years ago

oldani commented 5 years ago

If you're using proxies with requests-html and rendering JS sites is all good. Once you render a website pyppeteer don't know about this proxies and will expose your IP. This is an undesired behavior when scraping with proxies.

The idea is that whenever someone passes in proxies to the session object or any method call, make pyppeteer also use these proxies. #265

Bobspadger commented 5 years ago

This would be a good item to get fixed, currently when rendering I have to stop using proxy servers.

oldani commented 5 years ago

I will take on this

Bobspadger commented 5 years ago

cool thanks, I was going to take a look later but I'm not up on the whole async thing yet :)

ep4devops commented 5 years ago

I am in a very restrictive Coorporate Network and expiriencing many issues with Python and Proxies since the beginning of using requests-html. My goal is to scrape some cisco site, which has al lot of html returned by js - therefor I have to use the render functionality.

1st (solved manually) The initial Chromium Download of pyppeteer does not use proxies, so I had to download it manually and check where it expects to be:

python -c 'import pyppeteer; print(pyppeteer.chromium_downloader.chromiumExecutable)'

>>'win64': WindowsPath('C:/Users/XXX/AppData/Local/pyppeteer/pyppeteer/local-chromium/575458/chrome-win32/chrome.exe'

2nd (solved manually) Chromium does not accept Auth+Password given to --proxy-server="XXX" arg, see here

Now I am starting chromium with session = HTMLSession(browser_args=['--no-sandbox', '--proxy-pac-url="http://XXX/XXX.pac"']) while using the Proxy Auto Auth addon for chromium...

Start chrome.exe with the --proxy-pac-url="http://XXX/XXX.pac argument, enter your credentials and install the Proxy Auto Auth addon. Restart chrome.exe with the arguemts and check if you can use it without any proxy auth.

3rd (not solved yet) The render function does not use my proxy:

req = session.get(url=url, proxies=proxyDict, verify=False)
req.html.render()

pyppeteer.errors.PageError: net::ERR_NAME_NOT_RESOLVED at <URL>

I would be very happy if this can be solved ...

FlyingZebra1 commented 5 years ago

+1 On this being an amazing thing to get resolved.

predicador37 commented 5 years ago

Are there any news about this issue? Scraping behind corporate proxies is impossible right now... Any planned progress on this? Thank you

lauevrar77 commented 4 years ago

Is there any news on this ? I saw this commit but don't know if it is the expected patch : https://github.com/psf/requests-html/pull/396

According to me, the best solution would be to be able to use proxies in the same way as requests do (from env or dict). Is it possible at this time ?

MrIdjit commented 4 years ago

How is this going? I would like to know how I can use socks5 proxies with requests-html... and the .render() function.

Bobspadger commented 3 years ago

bump? any updates?

kiriharu commented 3 years ago

bump

W-Booth commented 2 years ago

bump

killerdevildog11 commented 2 years ago

any updates?

andrewshrout commented 2 years ago

any updates?

killerdevildog11 commented 2 years ago

I have used selenium for alternative, however it is a lot slower