seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
5.31k stars 974 forks source link

SeleniumBase open_page func is leaking real IP. #2148

Closed OpsecGuy closed 1 year ago

OpsecGuy commented 1 year ago

Hello, I was doing some tests with proxified SeleniumBase Chrome driver and I experienced an issue where my real IP was leaked when calling the open_page function.

Here is the code for how I create a Chrome driver: self.proxy returns e.g. 1.1.1.1:80

def spawn_browser(self) -> Any:
        try:
            driver = Driver(
                browser = 'chrome',
                headless = None, # DEF: False
                headless2 = self.headless,
                proxy=self.proxy if self.proxy != None else None,
                agent=self.user_agent,
                incognito = True,
                dark_mode = True,
                devtools = False,
                uc = True,
                extension_dir=self.extension_dir
            )
            return driver
        except Exception:
            Log().warn(self.identifier, self.get_class_name(), 'Failed to spawn browser', 72)

and when I call functions open_page() or get() it makes additional request using requests library with non-proxified IP address.

NGINX LOGS: My IP starts with 31 and ends with 19 Proxy IP starts with 172 and ends with 138

MY SCRIPT LOGS:

I have checked SeleniumBase code a little bit and looks like open_page() or get() uses that function: https://github.com/seleniumbase/SeleniumBase/blob/1e219cef8ae5d0ca3570b17197d4d0c96c8ee801/seleniumbase/core/browser_launcher.py#L173

FIX

While writing this issue report I did manage to find out what's wrong and problem is in get_proxy_info() function: Here is proper way that fixes all the issues:

def get_proxy_info():
    use_proxy = None
    protocol = "http"
    proxy_string = None
    user_and_pass = None
    if "--proxy=" in str(*sys.argv):
        from seleniumbase.core import proxy_helper
        for arg in sys.argv:
            if arg.startswith("--proxy="):
                proxy_string = arg.split("--proxy=")[1]
                if "@" in proxy_string:
                    # Format => username:password@hostname:port
                    try:
                        user_and_pass = proxy_string.split("@")[0]
                        proxy_string = proxy_string.split("@")[1]
                    except Exception:
                        raise Exception(
                            "The format for using a proxy server with auth "
                            'is: "username:password@hostname:port". If not '
                            'using auth, the format is: "hostname:port".'
                        )
                if proxy_string.endswith(":443"):
                    protocol = "https"
                elif "socks4" in proxy_string:
                    protocol = "socks4"
                elif "socks5" in proxy_string:
                    protocol = "socks5"
                proxy_string = proxy_helper.validate_proxy_string(proxy_string)
                if user_and_pass:
                    proxy_string = "%s@%s" % (user_and_pass, proxy_string)
                use_proxy = True
                break
    return (use_proxy, protocol, proxy_string)

That way worked for me and I fixed it in like 5 minutes so for sure someone should take a look at that if it isn't affecting other parts of the code in the project.

// I am not too sure why there was an additional spacing in sys.argv (' --proxy=1.1.1.1:22'), but if it really matters just add it back.

mdmintz commented 1 year ago

Your changes led to errors:

 103  ->     if "--proxy=" in str(*sys.argv):
 104             from seleniumbase.core import proxy_helper
 105             for arg in sys.argv:
 106                 if arg.startswith("--proxy="):
 107                     proxy_string = arg.split("--proxy=")[1]
 108                     if "@" in proxy_string:
 109                         # Format => username:password@hostname:port
 110                         try:
 111                             user_and_pass = proxy_string.split("@")[0]
 112                             proxy_string = proxy_string.split("@")[1]
 113                         except Exception:
 114                             raise Exception(
 115     ...
TypeError: str() takes at most 3 arguments (4 given)
mdmintz commented 1 year ago

Also, the browser proxy is working as expected. requests may be used to determine if a website is throwing a 403, but then you get to it on the proxy with the other IP in UC Mode.

mdmintz commented 1 year ago

So, for that part of the code, if you use --proxy-driver, it will use proxy settings from --proxy=PROXY to download chromedriver / uc_driver, etc. That section is not actually related to the proxy used with the selenium browser. If you use --proxy=PROXY but don't add --proxy-driver, then the driver will be downloaded without the proxy, but the browser will still use proxy settings.