seleniumbase / SeleniumBase

đź“Š Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
5.01k stars 944 forks source link

Seleniumbase stopped while using multiple authenticated proxies in multi-threading on Ubuntu. #3044

Closed AdilMughal2126 closed 3 weeks ago

AdilMughal2126 commented 3 weeks ago

I am working on building a large scraper where I need to interact with the target website search some filenames solve captchas using capsolver API and append data to Google Sheets. My target is to pull 15k records daily. I am using concurrent.futures for multithreading. When I run the code initially it works fine but when it exceeds 1500 records it keeps slowing down and eventually stops with the following exception: [selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python](https://stackoverflow.com/questions/56483403/selenium-common-exceptions-webdriverexception-message-invalid-session-id-using)

My downloads_files folder gets polluted with a lot of proxy files but that's okay as @mdmintz explained that it will generate proxy_ext_dir_0 etc files. But I don't know why it stopped eventually. Here is my code:

code explanation: What I am doing here is I input starting_file_number and ending_file_number the script will divide the number ranges and give them to every instance and if IP block it will call _get_newdriver() function to get a fresh driver with a new proxy. I want to spin up around 10 to 15 instances and have 100 data center proxies from webshare.io If someone faced this issue before please help me.

  def get_new_driver():
    proxy = proxy_rotator.get_next_proxy()
    return Driver(uc=True, headless=True, multi_proxy=True, proxy=proxy, no_sandbox=True, block_images=True)

   def run_scraper(starting_file_number, end_file_number):

    current_file_number = starting_file_number
    driver = get_new_driver()

    # Only for Debugging
    # timestamp = time.strftime("%Y%m%d-%H%M%S")
    # filename = f"screenshot_{timestamp}.png"

    try:
        driver.uc_open_with_reconnect(SITE_URL, 10)

        csv_file = 'entity_details2.csv'

        while current_file_number <= end_file_number:
            try:
                search_entity(driver, current_file_number)

                form_element = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.TAG_NAME, 'form'))
                )
                form_text = form_element.text

                if 'Please complete the reCAPTCHA' in form_text:
                    handle_recaptcha(driver, CAPSOLVER_API_KEY,
                                     SITE_KEY, SITE_URL)
                    search_entity(driver, current_file_number)

                result = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located(
                        (By.CSS_SELECTOR, '#tblResults'))
                )
                WebDriverWait(driver, 10).until(
                    EC.presence_of_all_elements_located(
                        (By.CSS_SELECTOR, '#tblResults tr'))
                )

                # After ensuring rows are present, get them
                rows = result.find_elements(By.CSS_SELECTOR, 'tr')[
                    1:]  # Exclude the first row

                for row in rows:
                    td = row.find_element(By.CSS_SELECTOR, 'a')
                    td.click()

                    entity_details = scrape_entity_details(driver)
                    print(entity_details)
                    if entity_details:
                        append_to_csv(entity_details, csv_file)

                    driver.back()

                    WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located(
                            (By.CSS_SELECTOR, '#tblResults'))
                    )
                current_file_number += 1
            except Exception as e:

                # Only for Debugging
                # timestamp = time.strftime("%Y%m%d-%H%M%S")
                # filename = f"screenshot_{timestamp}.png"
                # source_filename = f"screenshot_{timestamp}"
                # Take a screenshot and save it with the dynamic filename
                # driver.save_screenshot(filename)

                print(f"An error occurred for file number {
                      current_file_number}: {e}")

                msg = driver.title + ' Custpm Message'
                logger.warning(msg=msg)
                # if '502 Bad Gateway' in driver.title:
                driver.quit()
                driver = get_new_driver()
                print('Getting new driver...')
                driver.uc_open_with_reconnect(SITE_URL, 10)
                continue

    finally:
        driver.quit()

    if __name__ == "__main__":
    # Number of concurrent scrapers you want to run
    num_workers = 3

    # Define ranges for different instances
    # start_file_number = 4618981
    # end_file_number = 4623981

    start_file_number, end_file_number = get_file_numbers()
    range_size = (end_file_number - start_file_number + 1) // num_workers

    file_number_ranges = [
        (start_file_number + i * range_size,
         start_file_number + (i + 1) * range_size - 1)
        for i in range(num_workers)
    ]

    # Ensure the last range extends to the end file number
    file_number_ranges[-1] = (file_number_ranges[-1][0], end_file_number)

    print(Fore.CYAN + "File number ranges for each worker:", file_number_ranges)

    # Execute the scrapers
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(run_scraper, start, end)
                   for start, end in file_number_ranges]

        for future in futures:
            future.result()

    print("All scraping instances are done!")
mdmintz commented 3 weeks ago

There's an example of using ThreadPoolExecutor in the UC Mode docs:

import sys
from concurrent.futures import ThreadPoolExecutor
from seleniumbase import Driver
sys.argv.append("-n")  # Tell SeleniumBase to do thread-locking as needed

def launch_driver(url):
    driver = Driver(uc=True)
    try:
        driver.get(url=url)
        driver.sleep(2)
    finally:
        driver.quit()

urls = ['https://seleniumbase.io/demo_page' for i in range(3)]
with ThreadPoolExecutor(max_workers=len(urls)) as executor:
    for url in urls:
        executor.submit(launch_driver, url)

You didn't activate thread-locking in your script. You used headless mode, which prevents PyAutoGUI from working. To make the special virtual display work on Ubuntu, you need to use the SB() format instead of the Driver() format, so that things run in a GUI-less display.

Since you're doing mass-multithreading, I'd consider adding some process cleanup with psutil, as well as cleaning up the downloads_files folder after every 1000 runs or so, before it crashes.

SeleniumBase has it's own automatic-waiting methods, so you shouldn't need to use WebDriverWait or EC.presence_of_element_located at all.

AdilMughal2126 commented 3 weeks ago

Thank you so much for your help, @mdmintz. I have a question: does PyAutoGUI come bundled with SeleniumBase, or do I need to install it separately? Also, I'm curious about cleaning up the folder during code execution—could that potentially cause issues for the script?

If possible, could you share a simple example of how to clean up processes, download files, and how to utilize SB in a multithreaded environment? That would really help simplify things for me. If you could also consider adding such examples to the documentation, that would be amazing.

Thank you in advance, Michale. You're doing an excellent job for the community, and I truly appreciate your responsiveness and all your hard work.

PS: By the way I am using thread locking I watched your video on YouTube where you explained that in UC mode video 2.

from sbvirtualdisplay import Display
sys.argv.append("-n")
display = Display(visible=0, size=(1440, 1880))
display.start()
mdmintz commented 3 weeks ago

PyAutoGUI will get automatically installed by SeleniumBase at runtime if needed if it isn't already installed. You may choose to install it in advance.

There aren't too many multithreading examples for ThreadPoolExecutor. You've pretty much seen "the" example. I'm mostly using multithreading with the pytest formats, as it does much of the work for you. Plenty of examples of that.

See https://github.com/seleniumbase/SeleniumBase/issues/2860#issuecomment-2176340503 for examples of the above.

For process cleanup, I would experiment around to see what works best for your environment. You might be able to clear up the downloaded_files folder in the middle of your script run, but you may want to pause things briefly if using psutil for process-cleanup before resuming.

AdilMughal2126 commented 3 weeks ago

Yeah, I have read almost all of your issues related to this multithreading on GitHub. But as I mentioned I am using around 100 proxies. Is it feasible to use this number with pytest? Will pytest automatically do all of the cleanup things ? I am sure my script crashed because of too many processes and download_folder. The running time of my script will be around 10 hours daily.

PS: Even with the Driver format it's working great in the start with sbvirtualDisplay. So the main issue comes from processes or downloads_folders

mdmintz commented 3 weeks ago

The SB() and pytest formats do more cleanup that the Driver() format, but that part is more specific to folder-cleanup than process-cleanup. You'll have to experiment to find out what works best.

AdilMughal2126 commented 3 weeks ago

Thanks Michael, If you add more examples related to folder cleanups and multithreading in docs in the future that will be great.

AdilMughal2126 commented 3 weeks ago

I noticed the issue I was facing because of processes somehow I managed to clean the downloads folder but now I don't know how can I clean up the Chrome processes and how we know which process to remove basically if we kill all processes our script will crash. So any idea how can we figure out which process to remove?

mdmintz commented 3 weeks ago

You can use for proc in psutil.process_iter(): to iterate through processes, as described in https://stackoverflow.com/q/26627176/7058266. (Also see https://stackoverflow.com/a/20292161/7058266 for figuring out which processes are still alive and gone so that you can perform addition actions as necessary.) From that, you can terminate all Chrome processes with a script like this:

import psutil

# ...

proc_list = []
for proc in psutil.process_iter():
    if "chrome" in proc.name().lower():
        proc_list.append(proc)
        proc.terminate()
        time.sleep(3)  # Adjust timing as needed
        if proc.is_running():
            proc.kill()
gone, alive = psutil.wait_procs(proc_list, timeout=3)  # Adjust timing as needed
for proc in alive:
    proc.kill()  # In case the process wasn't terminated the first time

The part with gone, alive may be unnecessary. Also, there's a chance that a new processes will spin up with the same ID as a previously terminated process, which may cause an unrelated process to get terminated by accident during the gone, alive section.

AdilMughal2126 commented 3 weeks ago

This code work fine on windows and killed processes but on Ubuntu its not working just printing first and last print on console. But when I run it in python3 shell on ubuntu it works. I gave permissions to file as well but still doesn't work.

By the way, can you give me some suggestions related to my project I have been scraping site which is very restrictive and used recaptcha after every search and also blocked IPS frequently? SO for every single record I have to bypass captcha using captcha service. I am using AWS ec2 server with 4 CPUs and 8gb ram. Is it better to automate it or should I monitor it and scrape data in chunks because I need daily 15k records . Your suggestion will very helpful.

def kill_processes():
    print('Process killing started...')
    proc_list = []
    for proc in psutil.process_iter():
        if "chrome" in proc.name().lower() or "uc_driver" in proc.name().lower():
            proc_list.append(proc)
            proc.terminate()
            print(proc)
            time.sleep(3)  # Adjust timing as needed
            if proc.is_running():
                proc.kill()
    gone, alive = psutil.wait_procs(
        proc_list, timeout=3)  # Adjust timing as needed
    for proc in alive:
        proc.kill()
mdmintz commented 3 weeks ago

reCAPTCHA? UC Mode is only good for CF Turnstile and a few other sites that have bot checks.

I haven't used UC Mode in heavy load like that, so you may have to experiment around. (15k is a lot)

AdilMughal2126 commented 3 weeks ago

I am using third party capsolver api which solves the reCAPTCHA for me. Yeah I think I need to talk with my client. It's better to scrape data in chunks then 15k and also daily when site is too restrictive. Thanks a lot for help