Closed opqpop closed 2 years ago
Hello, have you solved the multiprocessing problem? I still need some help with it :/
You should create the drivers before launching your threads. Creating a driver is not thread safe: until one chromedriver is actually running all driver instantiation results in the chromedriver.exe being deleted then downloaded from googleapis.com and patched with random values.
from concurrent.futures import ThreadPoolExecutor, as_completed
import undetected_chromedriver as webdriver
def do_something_with(driver):
print("use the driver...")
return "result"
if __name__ == "__main__":
N_JOBS = 3
drivers = (webdriver.Chrome() for _ in range(N_JOBS))
with ThreadPoolExecutor(max_workers=N_JOBS) as executor:
futures = {
executor.submit(do_something_with, driver): driver
for driver in drivers
}
for future in as_completed(futures):
print(future.result())
futures[future].quit() # driver.quit()
if you really need to instantiate drivers in your threads or processes, you want to patch chromedriver.exe first then make it read-only so that any further driver instantiations won't patch it
import os
import stat
import undetected_chromedriver as webdriver
patcher = webdriver.Patcher()
# add write permission to chromdriver.exe and patch it
os.chmod(patcher.executable_path, stat.S_IREAD + stat.S_IWRITE)
patcher.auto()
# make chromedriver.exe read-only to prevent any further patch
os.chmod(patcher.executable_path, stat.S_IREAD)
# then launch your threads that instantiate drivers
Thanks, I've since found another problem with a 12 process mac I'm using: out of the 12, sometimes 0-4 drivers start and just sit there and do nothing, because it ran into this issue:
Traceback (most recent call last):
File "/Users/markx/undetected-chromedriver/undetected_chromedriver/__init__.py", line 375, in __init__
super(Chrome, self).__init__(
File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/chromium/webdriver.py", line 90, in __init__
self.service.start()
File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 98, in start
self.assert_process_still_running()
File "/Users/markx/scraper/venv/lib/python3.9/site-packages/selenium/webdriver/common/service.py", line 110, in assert_process_still_running
raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: Service /Users/markx/Library/Application Support/undetected_chromedriver/chromedriver unexpectedly exited. Status code was: -9
I was able to fix this by modifying uc driver init to skip the patching, and just use an already patched chromedriver path.
This was before I saw your above solution to do an initial patch and lock it to prevent the processes from doing their own patches. However, just tried above solution and for some reason it's still getting this error code. Any ideas?
I will continue digging to check whether making it read-only truly prevents the patcher from not doing anything. Thanks for the help so far
re: passing in the drivers to the processes, I've tried that in the past and failed and ended up concluding that webdrivers aren't pickle-able per some folks suggestions. It always just crashes trying to pass it as a parameter. However, I haven't tried your suggested solution to see whether it gets around this problem, will give it a shot soon
if you really need to instantiate drivers in your threads or processes, you want to patch chromedriver.exe first then make it read-only so that any further driver instantiations won't patch it
import os import stat import undetected_chromedriver as webdriver patcher = webdriver.Patcher() # add write permission to chromdriver.exe and patch it os.chmod(patcher.executable_path, stat.S_IREAD + stat.S_IWRITE) patcher.auto() # make chromedriver.exe read-only to prevent any further patch os.chmod(patcher.executable_path, stat.S_IREAD) # then launch your threads that instantiate drivers
hmm it seems like making it read-only doesnt prevent the unlink from happening (doesnt return PermissionsError, it just deletes the file and keeps going)
(Pdb) os.stat(patcher.executable_path)
os.stat_result(st_mode=33024, st_ino=249507007, st_dev=16777222, st_nlink=1, st_uid=501, st_gid=20, st_size=16498136, st_atime=1644116290, st_mtime=1644116290, st_ctime=1644116290)
(Pdb) os.unlink(patcher.executable_path)
(Pdb)
EDIT: fixed, deletes are controlled by write permissions on parent dir (https://unix.stackexchange.com/a/451427), so need to do this:
(Pdb) par = os.path.dirname(patcher.executable_path)
(Pdb) os.chmod(par, stat.S_IREAD)
(Pdb) os.unlink(patcher.executable_path)
*** PermissionError: [Errno 13] Permission denied: '/Users/markx/Library/Application Support/undetected_chromedriver/chromedriver
Confirmed, this also fixed windows issue, thanks!
good catch. But on Windows changing the parent directory permissions doesn't prevent the driver from being deleted. Here is a more generic solution that checks the OS:
import os
import stat
import undetected_chromedriver as webdriver
def lock_file(filename, lock=True):
permissions = stat.S_IREAD
if not lock:
permissions += stat.S_IWRITE
if webdriver.IS_POSIX:
filename = os.path.dirname(filename)
if os.path.exists(filename):
os.chmod(filename, permissions)
if __name__ == "__main__":
patcher = webdriver.Patcher()
# unlock chromdriver.exe and patch it
lock_file(patcher.executable_path, lock=False)
patcher.auto()
# lock chromedriver.exe & monkey patch Patcher
# to prevent the patcher from reading or writing it
lock_file(patcher.executable_path)
webdriver.Patcher.is_binary_patched = lambda self: True
# now you can instantiate and use drivers in threads without random PermissionError
# instantiation is way faster since there are no more download and patch ops.
The only caveat is that if you run a lot of chromedriver.exe concurrently with the same signature you might end up being detected. In that case you would need to patch a different chromedriver.exe for each thread, and I don't see a solution for that without modifying patcher.py
or monkey patching Patcher
.
EDIT: monkey patch of Patcher.is_binary_patched
to prevent any further chromedriver.exe reading…
Check why.
The only caveat is that if you run a lot of chromedriver.exe concurrently with the same signature you might end up being detected. In that case you would need to patch a different chromedriver.exe for each thread, and I don't see a solution for that without modifying patcher.py or monkey patching Patcher.
That's a very interesting point! I'd definitely want this so will give it a try. Just to make sure I understand, if I'm doing say 12 processes, I'd patch 12 times, but make sure they put the chromedriver in a different path to not overwrite each other. Now they will each use a different signature because they are using a different chromedriver.
And because they no longer share the file, there would not be a need to do the locking solution above anymore, since hopefully the "chromedriver unexpectedly exited. Status code was: -9" and "PermissionError: [Errno 13] Permission denied:" errors would no longer happen.
Does my understanding sound correct? I'll play around with this approach today.
that's exactly the idea and it would be even better to download the zip only once. I'm not sure it's worth it though :sweat_smile:
oh I've found another PermissionError
: when failing to delete chromedriver.exe, the patcher calls Patcher.is_binary_patched()
to check if it’s been already patched and only then it returns without further ado. But the patch method has to open chromedriver.exe to read it and randomly raises a PermissionError
when done concurrently then the patcher will think it has to patch the driver again and fails because chromedriver.exe is now read-only. To avoid that Patcher.is_binary_patched()
has to be monkey patched to always return True
after the 1st patch has been applied:
Patcher.is_binary_patched = lambda self: True
I’ve edited this answer’s code to fix it.
I use multiprocess to start multiple undetected_chromedriver in each process, so that I can parallelize the work
However, for my windows computer (never happens with mac), sometimes some process will just not start the chrome driver, and I believe it's due to a PermissionsError not being able to open up the downloaded and unzipped file. Here's a stack trace:
The first process that calls this however will always go through, and the chrome window pops up, but sometimes subsequent processes never pops a window and exits with this error.
Any ideas what I can do to try to fix this? I've tried setting browser_executable_path but that seems to still allow patcher to keep trying to patch. Should I try to disable the patcher and always have it use the latest chromedriver downloaded in my local folder? Let me know if there's anything else I can provide / help out with. Love this project and its critical to some of our workload!
Code