ultrafunkamsterdam / undetected-chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
https://github.com/UltrafunkAmsterdam/undetected-chromedriver
GNU General Public License v3.0
9.26k stars 1.1k forks source link

In Windows, multiprocess sometimes wont let drivers start due to PermissionsError in patcher #479

Closed opqpop closed 2 years ago

opqpop commented 2 years ago

I use multiprocess to start multiple undetected_chromedriver in each process, so that I can parallelize the work

However, for my windows computer (never happens with mac), sometimes some process will just not start the chrome driver, and I believe it's due to a PermissionsError not being able to open up the downloaded and unzipped file. Here's a stack trace:

  File "C:\Users\Mark\scraper\venv\lib\site-packages\undetected_chromedriver\__init__.py", line 208, in __init__
    patcher.auto()
  File "C:\Users\Mark\scraper\venv\lib\site-packages\undetected_chromedriver\patcher.py", line 121, in auto
    self.unzip_package(self.fetch_package())
  File "C:\Users\Mark\scraper\venv\lib\site-packages\undetected_chromedriver\patcher.py", line 175, in unzip_package
    zf.extract(self.exe_name, os.path.dirname(self.executable_path))
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\zipfile.py", line 1616, in extract
    return self._extract_member(member, path, pwd)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\zipfile.py", line 1687, in _extract_member
    open(targetpath, "wb") as target:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\Mark\\appdata\\roaming\\undetected_chromedriver\\chromedriver.exe'
2022-02-03 01:10:09 [twisted] CRITICAL:

The first process that calls this however will always go through, and the chrome window pops up, but sometimes subsequent processes never pops a window and exits with this error.

Any ideas what I can do to try to fix this? I've tried setting browser_executable_path but that seems to still allow patcher to keep trying to patch. Should I try to disable the patcher and always have it use the latest chromedriver downloaded in my local folder? Let me know if there's anything else I can provide / help out with. Love this project and its critical to some of our workload!

Code


 options = uc.ChromeOptions()

            options.add_argument("--disable-extensions")
            options.add_argument("--disable-popup-blocking")
            options.add_argument("--profile-directory=Default")
            options.add_argument("--ignore-certificate-errors")
            options.add_argument("--disable-plugins-discovery")

            dr = uc.Chrome(
                options=options,
                # tried adding this but didnt do anything, I think because it only sets the option binary in the uc init, while the patcher continues to run
                # browser_executable_path=cfgutil.get_chrome_driver_path(),
            )
matukaking commented 2 years ago

Hello, have you solved the multiprocessing problem? I still need some help with it :/

sebdelsol commented 2 years ago

You should create the drivers before launching your threads. Creating a driver is not thread safe: until one chromedriver is actually running all driver instantiation results in the chromedriver.exe being deleted then downloaded from googleapis.com and patched with random values.

from concurrent.futures import ThreadPoolExecutor, as_completed
import undetected_chromedriver as webdriver

def do_something_with(driver):
    print("use the driver...")
    return "result"

if __name__ == "__main__":
    N_JOBS = 3
    drivers = (webdriver.Chrome() for _ in range(N_JOBS))

    with ThreadPoolExecutor(max_workers=N_JOBS) as executor:
        futures = {
            executor.submit(do_something_with, driver): driver 
            for driver in drivers
        }
        for future in as_completed(futures):
            print(future.result())
            futures[future].quit()  # driver.quit()
sebdelsol commented 2 years ago

if you really need to instantiate drivers in your threads or processes, you want to patch chromedriver.exe first then make it read-only so that any further driver instantiations won't patch it

import os
import stat
import undetected_chromedriver as webdriver

patcher = webdriver.Patcher()
# add write permission to chromdriver.exe and patch it
os.chmod(patcher.executable_path, stat.S_IREAD + stat.S_IWRITE)
patcher.auto()
# make chromedriver.exe read-only to prevent any further patch
os.chmod(patcher.executable_path, stat.S_IREAD)

# then launch your threads that instantiate drivers
opqpop commented 2 years ago

Thanks, I've since found another problem with a 12 process mac I'm using: out of the 12, sometimes 0-4 drivers start and just sit there and do nothing, because it ran into this issue:

Traceback (most recent call last):
  File "/Users/markx/undetected-chromedriver/undetected_chromedriver/__init__.py", line 375, in __init__
    super(Chrome, self).__init__(
  File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
    super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/chromium/webdriver.py", line 90, in __init__
    self.service.start()
  File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 98, in start
    self.assert_process_still_running()
  File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 110, in assert_process_still_running
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: Service /Users/markx/Library/Application Support/undetected_chromedriver/chromedriver unexpectedly exited. Status code was: -9

I was able to fix this by modifying uc driver init to skip the patching, and just use an already patched chromedriver path.

This was before I saw your above solution to do an initial patch and lock it to prevent the processes from doing their own patches. However, just tried above solution and for some reason it's still getting this error code. Any ideas?

I will continue digging to check whether making it read-only truly prevents the patcher from not doing anything. Thanks for the help so far

opqpop commented 2 years ago

re: passing in the drivers to the processes, I've tried that in the past and failed and ended up concluding that webdrivers aren't pickle-able per some folks suggestions. It always just crashes trying to pass it as a parameter. However, I haven't tried your suggested solution to see whether it gets around this problem, will give it a shot soon

opqpop commented 2 years ago

if you really need to instantiate drivers in your threads or processes, you want to patch chromedriver.exe first then make it read-only so that any further driver instantiations won't patch it

import os
import stat
import undetected_chromedriver as webdriver

patcher = webdriver.Patcher()
# add write permission to chromdriver.exe and patch it
os.chmod(patcher.executable_path, stat.S_IREAD + stat.S_IWRITE)
patcher.auto()
# make chromedriver.exe read-only to prevent any further patch
os.chmod(patcher.executable_path, stat.S_IREAD)

# then launch your threads that instantiate drivers

hmm it seems like making it read-only doesnt prevent the unlink from happening (doesnt return PermissionsError, it just deletes the file and keeps going)

(Pdb) os.stat(patcher.executable_path)
os.stat_result(st_mode=33024, st_ino=249507007, st_dev=16777222, st_nlink=1, st_uid=501, st_gid=20, st_size=16498136, st_atime=1644116290, st_mtime=1644116290, st_ctime=1644116290)
(Pdb) os.unlink(patcher.executable_path)
(Pdb)

EDIT: fixed, deletes are controlled by write permissions on parent dir (https://unix.stackexchange.com/a/451427), so need to do this:

(Pdb) par = os.path.dirname(patcher.executable_path)
(Pdb) os.chmod(par, stat.S_IREAD)
(Pdb) os.unlink(patcher.executable_path)
*** PermissionError: [Errno 13] Permission denied: '/Users/markx/Library/Application Support/undetected_chromedriver/chromedriver

Confirmed, this also fixed windows issue, thanks!

sebdelsol commented 2 years ago

good catch. But on Windows changing the parent directory permissions doesn't prevent the driver from being deleted. Here is a more generic solution that checks the OS:

import os
import stat
import undetected_chromedriver as webdriver

def lock_file(filename, lock=True):
    permissions = stat.S_IREAD
    if not lock:
        permissions += stat.S_IWRITE

    if webdriver.IS_POSIX:
        filename = os.path.dirname(filename)
    if os.path.exists(filename):
        os.chmod(filename, permissions)

if __name__ == "__main__":
    patcher = webdriver.Patcher()
    # unlock chromdriver.exe and patch it
    lock_file(patcher.executable_path, lock=False)
    patcher.auto()
    # lock chromedriver.exe & monkey patch Patcher 
    # to prevent the patcher from reading or writing it
    lock_file(patcher.executable_path)
    webdriver.Patcher.is_binary_patched = lambda self: True

   # now you can instantiate and use drivers in threads without random PermissionError
   # instantiation is way faster since there are no more download and patch ops.

The only caveat is that if you run a lot of chromedriver.exe concurrently with the same signature you might end up being detected. In that case you would need to patch a different chromedriver.exe for each thread, and I don't see a solution for that without modifying patcher.py or monkey patching Patcher.

EDIT: monkey patch of Patcher.is_binary_patched to prevent any further chromedriver.exe reading… Check why.

opqpop commented 2 years ago

The only caveat is that if you run a lot of chromedriver.exe concurrently with the same signature you might end up being detected. In that case you would need to patch a different chromedriver.exe for each thread, and I don't see a solution for that without modifying patcher.py or monkey patching Patcher.

That's a very interesting point! I'd definitely want this so will give it a try. Just to make sure I understand, if I'm doing say 12 processes, I'd patch 12 times, but make sure they put the chromedriver in a different path to not overwrite each other. Now they will each use a different signature because they are using a different chromedriver.

And because they no longer share the file, there would not be a need to do the locking solution above anymore, since hopefully the "chromedriver unexpectedly exited. Status code was: -9" and "PermissionError: [Errno 13] Permission denied:" errors would no longer happen.

Does my understanding sound correct? I'll play around with this approach today.

sebdelsol commented 2 years ago

that's exactly the idea and it would be even better to download the zip only once. I'm not sure it's worth it though :sweat_smile:

sebdelsol commented 2 years ago

oh I've found another PermissionError: when failing to delete chromedriver.exe, the patcher calls Patcher.is_binary_patched() to check if it’s been already patched and only then it returns without further ado. But the patch method has to open chromedriver.exe to read it and randomly raises a PermissionError when done concurrently then the patcher will think it has to patch the driver again and fails because chromedriver.exe is now read-only. To avoid that Patcher.is_binary_patched() has to be monkey patched to always return True after the 1st patch has been applied:

Patcher.is_binary_patched = lambda self: True

I’ve edited this answer’s code to fix it.