When using ThreadPoolExecutor, a minor problem occurs.

seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.

MIT License

4.46k stars 909 forks source link

Now, I'm using like below.

def setup_driver() -> Driver:
    driver = Driver(browser="chrome", headless=True, no_sandbox=True)
    return driver

# def crawling(driver): driver = setup_driver() driver.get(url) ... 

with ThreadPoolExecutor(max_workers=4) as executor:
        results[A] = executor.submit(
            crawling
        )
        results[B] = executor.submit(
            crawling
        )
        results[C] = executor.submit(
            crawling
        )
        results[D] = executor.submit(
            crawling
        )

This doesn't work well in parallel. How can I fix it?

It worked fine when using Selenium's Webdriver, but I want to use it(Seleniumbase) because of chromedriver's automatic download feature.

def setup_driver() -> webdriver.Chrome:
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(service=Service(), options=options)
    return driver

In above, it worked fine in local, but when running in environments like AWS Lambda, the Chrome WebDriver was not automatically installed properly.

Thank you.

import sys from concurrent.futures import ThreadPoolExecutor from seleniumbase import Driver sys.argv.append("-n") # Tell SeleniumBase to do thread-locking as needed def launch_driver(url): driver = Driver() try: driver.get(url=url) driver.sleep(2) finally: driver.quit() urls = ['https://seleniumbase.io/demo_page' for i in range(4)] with ThreadPoolExecutor(max_workers=len(urls)) as executor: for url in urls: executor.submit(launch_driver, url)

seleniumbase / SeleniumBase

When using ThreadPoolExecutor, a minor problem occurs. #2759