spinlud / py-linkedin-jobs-scraper

MIT License
311 stars 86 forks source link

add support for proxy auth - timeout exception #18

Open dudehacker opened 3 years ago

dudehacker commented 3 years ago

I tried using your library with proxy auth. which creates a extension and adds it to chrome options.

however I get TimeoutException because of this line in linkedin_scrappy.py:

 driver = build_driver(
                    executable_path=self.chrome_executable_path,
                    options=self.chrome_options,
                    headless=self.headless,
                    timeout=120  ### what i need to add to make the proxy work. default of 20 is not enough
                )

so it would be nice if you can make LinkedinScraper constructor accept a WebDriver argument (allows using Selenium-Wire library for proxy auth) or accept a timeout argument

additionally, it would also be nice to let user pass argument for timeout per job used by anonymous_strategy.py

def __load_job_details(driver: webdriver, selectors: Selectors, job_id: str, timeout=2) -> object:

dudehacker commented 3 years ago

1 way to use proxy auth with chrome using extension, which can't work with headless mode: chrome_options.add_extension(createProxyZip())

def createProxyZip(PROXY_HOST,PROXY_PORT,PROXY_USER,PROXY_PASS):
    manifest_json = """
            {
                "version": "1.0.0",
                "manifest_version": 2,
                "name": "Chrome Proxy",
                "permissions": [
                    "proxy",
                    "tabs",
                    "unlimitedStorage",
                    "storage",
                    "<all_urls>",
                    "webRequest",
                    "webRequestBlocking"
                ],
                "background": {
                    "scripts": ["background.js"]
                },
                "minimum_chrome_version":"22.0.0"
            }
            """

    background_js = """
    var config = {
            mode: "fixed_servers",
            rules: {
              singleProxy: {
                scheme: "http",
                host: "%(host)s",
                port: parseInt(%(port)d)
              },
              bypassList: []
            }
          };
    chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
    function callbackFn(details) {
        return {
            authCredentials: {
                username: "%(user)s",
                password: "%(pass)s"
            }
        };
    }
    chrome.webRequest.onAuthRequired.addListener(
                callbackFn,
                {urls: ["<all_urls>"]},
                ['blocking']
    );
        """ % {
            "host": PROXY_HOST,
            "port": PROXY_PORT,
            "user": PROXY_USER,
            "pass": PROXY_PASS,
        }

    pluginfile = 'proxy_auth_plugin.zip'

    with zipfile.ZipFile(pluginfile, 'w') as zp:
        zp.writestr("manifest.json", manifest_json)
        zp.writestr("background.js", background_js)

    return pluginfile

the other way is using selenium-wire which is more preferred https://github.com/wkeeling/selenium-wire