typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Upgrade selenium to 4.15.2 #54

Open CodeSandwich opened 8 months ago

CodeSandwich commented 8 months ago

Change Summary

Upgrades Selenium.

The change to how CHROMEDRIVER_PATH is handled comes from a weird behavior of Chrome. Setting chrome_options.binary_location to an actual path, e.g. the default /usr/bin/chromedriver causes this error:

E       selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
E         (session not created: DevToolsActivePort file doesn't exist)
E         (The process started from chrome location /usr/bin/chromedriver is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Based on this Selenium issue: https://github.com/SeleniumHQ/selenium/issues/12973 it seems to be coming from this unsolved issue in Chromium: https://bugs.chromium.org/p/chromedriver/issues/detail?id=4403, I've tried the proposed workarounds and they didn't work. Weirdly, this issue doesn't occur when chrome_options.binary_location is not set or is set to a file that is not a chromedriver binary. The default value didn't matter that much anyway, because Selenium somehow finds the binary under usr/bin automatically. This will work as expected as soon as Chromium fixes the issue.

PR Checklist

CodeSandwich commented 8 months ago

(Sorry for the rambly comment and the chaotic PR, I'm learning about Selenium as I'm going :sweat_smile:, I hope that it's going to be useful)

It actually seems to be working fine, but CHROMEDRIVER_PATH should be used differently. I pointed it at the binary installed on my machine named chromedriver (doing I'm not sure what), but I should point it just at chrome or chromium, it then works fine, all the tests are passing.

The change to how lack of CHROMEDRIVER_PATH is handled is still valid, Selenium picks chrome from /usr/bin/ without any configuration.

I don't know if Selenium changed the browser flavor it works with, they don't seem to keep the selenium-manager changelog. (It seems like the drivers are now built-in into Selenium, there probably no longer is any need to use a driver path: https://www.selenium.dev/documentation/legacy/selenium_2/faq/#q-what-is-selenium-20) The --help for the current version of selenium-manager does clearly demand a browser path and not a driver in --browser-path:

      --browser-path <BROWSER_PATH>
          Browser path (absolute) for browser version detection (e.g., /usr/bin/google-chrome, "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome", "C:\Program Files\Google\Chrome\Application\chrome.exe")

This is the option Selenium sets when binary_location is used: https://github.com/SeleniumHQ/selenium/blob/a24a189764316832503851c8f0e58d77d3779a6d/py/selenium/webdriver/common/selenium_manager.py#L92-L95, the Python wrapper doesn't seem to set --driver anywhere in its code except for a test setup. This is the only option in selenium-manager that could require a driver, but it still wouldn't be a path, just the name:

      --driver <DRIVER>
          Driver name (chromedriver, geckodriver, msedgedriver, IEDriverServer, or safaridriver)

Should CHROMEDRIVER_PATH be changed to just CHROME_PATH? It would be a breaking change.