Closed AdilMughal2126 closed 3 weeks ago
There's an example of using ThreadPoolExecutor
in the UC Mode docs:
import sys
from concurrent.futures import ThreadPoolExecutor
from seleniumbase import Driver
sys.argv.append("-n") # Tell SeleniumBase to do thread-locking as needed
def launch_driver(url):
driver = Driver(uc=True)
try:
driver.get(url=url)
driver.sleep(2)
finally:
driver.quit()
urls = ['https://seleniumbase.io/demo_page' for i in range(3)]
with ThreadPoolExecutor(max_workers=len(urls)) as executor:
for url in urls:
executor.submit(launch_driver, url)
You didn't activate thread-locking in your script. You used headless
mode, which prevents PyAutoGUI
from working. To make the special virtual display work on Ubuntu, you need to use the SB()
format instead of the Driver()
format, so that things run in a GUI-less display.
Since you're doing mass-multithreading, I'd consider adding some process cleanup with psutil
, as well as cleaning up the downloads_files
folder after every 1000 runs or so, before it crashes.
SeleniumBase has it's own automatic-waiting methods, so you shouldn't need to use WebDriverWait
or EC.presence_of_element_located
at all.
Thank you so much for your help, @mdmintz. I have a question: does PyAutoGUI come bundled with SeleniumBase, or do I need to install it separately? Also, I'm curious about cleaning up the folder during code execution—could that potentially cause issues for the script?
If possible, could you share a simple example of how to clean up processes, download files, and how to utilize SB in a multithreaded environment? That would really help simplify things for me. If you could also consider adding such examples to the documentation, that would be amazing.
Thank you in advance, Michale. You're doing an excellent job for the community, and I truly appreciate your responsiveness and all your hard work.
PS: By the way I am using thread locking I watched your video on YouTube where you explained that in UC mode video 2.
from sbvirtualdisplay import Display
sys.argv.append("-n")
display = Display(visible=0, size=(1440, 1880))
display.start()
PyAutoGUI
will get automatically installed by SeleniumBase at runtime if needed if it isn't already installed. You may choose to install it in advance.
There aren't too many multithreading examples for ThreadPoolExecutor
. You've pretty much seen "the" example. I'm mostly using multithreading with the pytest
formats, as it does much of the work for you. Plenty of examples of that.
See https://github.com/seleniumbase/SeleniumBase/issues/2860#issuecomment-2176340503 for examples of the above.
For process cleanup, I would experiment around to see what works best for your environment. You might be able to clear up the downloaded_files
folder in the middle of your script run, but you may want to pause things briefly if using psutil
for process-cleanup before resuming.
Yeah, I have read almost all of your issues related to this multithreading on GitHub. But as I mentioned I am using around 100 proxies. Is it feasible to use this number with pytest? Will pytest automatically do all of the cleanup things ? I am sure my script crashed because of too many processes and download_folder. The running time of my script will be around 10 hours daily.
PS: Even with the Driver format it's working great in the start with sbvirtualDisplay. So the main issue comes from processes or downloads_folders
The SB()
and pytest
formats do more cleanup that the Driver()
format, but that part is more specific to folder-cleanup than process-cleanup. You'll have to experiment to find out what works best.
Thanks Michael, If you add more examples related to folder cleanups and multithreading in docs in the future that will be great.
I noticed the issue I was facing because of processes somehow I managed to clean the downloads folder but now I don't know how can I clean up the Chrome processes and how we know which process to remove basically if we kill all processes our script will crash. So any idea how can we figure out which process to remove?
You can use for proc in psutil.process_iter():
to iterate through processes, as described in https://stackoverflow.com/q/26627176/7058266. (Also see https://stackoverflow.com/a/20292161/7058266 for figuring out which processes are still alive and gone so that you can perform addition actions as necessary.) From that, you can terminate all Chrome processes with a script like this:
import psutil
# ...
proc_list = []
for proc in psutil.process_iter():
if "chrome" in proc.name().lower():
proc_list.append(proc)
proc.terminate()
time.sleep(3) # Adjust timing as needed
if proc.is_running():
proc.kill()
gone, alive = psutil.wait_procs(proc_list, timeout=3) # Adjust timing as needed
for proc in alive:
proc.kill() # In case the process wasn't terminated the first time
The part with gone, alive
may be unnecessary. Also, there's a chance that a new processes will spin up with the same ID as a previously terminated process, which may cause an unrelated process to get terminated by accident during the gone, alive
section.
This code work fine on windows and killed processes but on Ubuntu its not working just printing first and last print on console. But when I run it in python3 shell on ubuntu it works. I gave permissions to file as well but still doesn't work.
By the way, can you give me some suggestions related to my project I have been scraping site which is very restrictive and used recaptcha after every search and also blocked IPS frequently? SO for every single record I have to bypass captcha using captcha service. I am using AWS ec2 server with 4 CPUs and 8gb ram. Is it better to automate it or should I monitor it and scrape data in chunks because I need daily 15k records . Your suggestion will very helpful.
def kill_processes():
print('Process killing started...')
proc_list = []
for proc in psutil.process_iter():
if "chrome" in proc.name().lower() or "uc_driver" in proc.name().lower():
proc_list.append(proc)
proc.terminate()
print(proc)
time.sleep(3) # Adjust timing as needed
if proc.is_running():
proc.kill()
gone, alive = psutil.wait_procs(
proc_list, timeout=3) # Adjust timing as needed
for proc in alive:
proc.kill()
reCAPTCHA? UC Mode is only good for CF Turnstile and a few other sites that have bot checks.
I haven't used UC Mode in heavy load like that, so you may have to experiment around. (15k is a lot)
I am using third party capsolver api which solves the reCAPTCHA for me. Yeah I think I need to talk with my client. It's better to scrape data in chunks then 15k and also daily when site is too restrictive. Thanks a lot for help
I am working on building a large scraper where I need to interact with the target website search some filenames solve captchas using capsolver API and append data to Google Sheets. My target is to pull 15k records daily. I am using concurrent.futures for multithreading. When I run the code initially it works fine but when it exceeds 1500 records it keeps slowing down and eventually stops with the following exception:
[selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python](https://stackoverflow.com/questions/56483403/selenium-common-exceptions-webdriverexception-message-invalid-session-id-using)
My downloads_files folder gets polluted with a lot of proxy files but that's okay as @mdmintz explained that it will generate proxy_ext_dir_0 etc files. But I don't know why it stopped eventually. Here is my code:
code explanation: What I am doing here is I input starting_file_number and ending_file_number the script will divide the number ranges and give them to every instance and if IP block it will call _get_newdriver() function to get a fresh driver with a new proxy. I want to spin up around 10 to 15 instances and have 100 data center proxies from webshare.io If someone faced this issue before please help me.