seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
4.46k stars 909 forks source link

When using ThreadPoolExecutor, a minor problem occurs. #2759

Closed hwk06023 closed 1 month ago

hwk06023 commented 1 month ago

Now, I'm using like below.

def setup_driver() -> Driver:
    driver = Driver(browser="chrome", headless=True, no_sandbox=True)
    return driver
# def crawling(driver): driver = setup_driver() driver.get(url) ... 

with ThreadPoolExecutor(max_workers=4) as executor:
        results[A] = executor.submit(
            crawling
        )
        results[B] = executor.submit(
            crawling
        )
        results[C] = executor.submit(
            crawling
        )
        results[D] = executor.submit(
            crawling
        )

This doesn't work well in parallel. How can I fix it?

It worked fine when using Selenium's Webdriver, but I want to use it(Seleniumbase) because of chromedriver's automatic download feature.

def setup_driver() -> webdriver.Chrome:
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(service=Service(), options=options)
    return driver

In above, it worked fine in local, but when running in environments like AWS Lambda, the Chrome WebDriver was not automatically installed properly.

Thank you.

mdmintz commented 1 month ago

A few things: If using ThreadPoolExecutor instead of pytest-xdist for multithreading, be sure to add sys.argv.append("-n") to activate SeleniumBase thread-locking as needed. Example:

import sys
from concurrent.futures import ThreadPoolExecutor
from seleniumbase import Driver
sys.argv.append("-n")  # Tell SeleniumBase to do thread-locking as needed

def launch_driver(url):
    driver = Driver()
    try:
        driver.get(url=url)
        driver.sleep(2)
    finally:
        driver.quit()

urls = ['https://seleniumbase.io/demo_page' for i in range(4)]
with ThreadPoolExecutor(max_workers=len(urls)) as executor:
    for url in urls:
        executor.submit(launch_driver, url)

For headless Linux environments, you may want to use xvfb=True in your Driver() declaration. Also note that there is a newer Chromium headless mode, which is activated by using headless2=True in your Driver() declaration.