ultrafunkamsterdam / undetected-chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
https://github.com/UltrafunkAmsterdam/undetected-chromedriver
GNU General Public License v3.0
9.96k stars 1.16k forks source link

MultiThreading when enable_cdp_events causes urllib3.connectionpool:Retrying and urllib3.connectionpool:Connection pool is full #1384

Closed pedro-peixot0 closed 1 year ago

pedro-peixot0 commented 1 year ago

Here is some sample code to reproduce the issue. I am using Mac and haven't tried other operational systems

from selenium import webdriver
import undetected_chromedriver as uc
from socket import socket
import concurrent.futures
import json
import time

PRODUCT_REQUEST_TIMEOUT = 30

def generate_webdriver(
    headless: bool,
    proxies: bool = None,
    enable_cdp_events: bool = False, # if set to truw allows network interception
    enable_port_selection: bool = False # this option increases the number of possible open webdrivers, but can have a few bugs
):
    def get_port():
        sock = socket()
        sock.bind(('', 0))
        port = sock.getsockname()[1]

        return port

    chrome_options  = webdriver.ChromeOptions() # Create driver options object
    chrome_options.add_argument('-incognito') # opens driver in incognito mode
    #chrome_options.add_argument("--disable-geolocation") # disable automatic geolocation
    chrome_options.add_argument("--disable-extensions") # disable other extensions
    chrome_options.add_argument("--lang=en-US") # Change Chrome language

    if headless:
        chrome_options.add_argument("--headless")

    if proxies:
        proxy = proxies['https']
        chrome_options.add_argument(f'--proxy-server={proxy}--timeout=120')

    browser = uc.Chrome(
        options = chrome_options,
        enable_cdp_events=enable_cdp_events,
        remotePort = get_port() if enable_port_selection else None,
        use_subprocess=False
    )

    return browser

def _get_product_raw_data(
    item_url: str,
):
    json_data_output = None
    def response_handler(event):
        nonlocal json_data_output  # Use nonlocal to refer to the outer variable
        try:
            request_id = event['params']['requestId']
            response_data = browser.execute_cdp_cmd('Network.getResponseBody', {'requestId': request_id})
            response_json = json.loads(response_data['body'])

            if response_json.get('data', {}).get('itemid'):
                json_data_output = response_json
                browser.quit()

        except Exception:
            pass

    browser = generate_webdriver(
        headless=True,
        enable_cdp_events=True,

    )

    # adding feature that will monitor the network
    browser.add_cdp_listener(
        event_name='Network.dataReceived',  # filtering event type for data received
        callback=response_handler # parsing function that will deal with the data filtered
    )

    browser.get(url=item_url)

    wait_start = time.time()
    while not json_data_output:
        if (time.time() - wait_start) > PRODUCT_REQUEST_TIMEOUT:
            browser.quit()
            raise TimeoutError(f"Took more than {PRODUCT_REQUEST_TIMEOUT} secconds to intecept product information request")
        time.sleep(0.5)

    return json_data_output

with concurrent.futures.ThreadPoolExecutor(
    max_workers=10
) as executor:
    futures = [
        executor.submit(
            _get_product_raw_data,
            "https://shopee.com.br/Energ%C3%A9tico-Red-Bull-Tropical-Lata-250ml-i.466923583.8767304790?sp_atk=b41e5366-51a4-4ae2-b313-7a1b9f99fbe5&xptdk=b41e5366-51a4-4ae2-b313-7a1b9f99fbe5"
        ) for i in range(100)
    ]

    for future in concurrent.futures.as_completed(futures):
        pass
pedro-peixot0 commented 1 year ago

thought it was a fix, but it did not work

dylankeep commented 1 year ago

I also met this issue, did you solved it?

dylankeep commented 1 year ago

thought it was a fix, but it did not work

I solved it, you need to shut down the browser, please use browser.close() to release all resources, browser.quit() does not release resources, there is reverse with selenium, I don't know why, you can also use browser.clear_cdp_listeners() or browser.execute_cdp_cmd("Network.disable", {}) to remove cdp listener

pedro-peixot0 commented 1 year ago

Hey @dylankeep, this did not work in my case. I actually found what was the issue:

When enable_cdp_events is set to True, a Reactor object is created, and with it, the object's listen function is called, starting a loop that invokes the driver.get_log function. This function uses a PoolManager / ConnectionPool object from urllib3 to monitor the network. By default, these objects handle a maximum of 1 connection, which is already being used by the aforementioned loop.

At some point, when we call the driver.quit() function to close the webdriver (I haven't tried to find where), it seems that this same object is accessed, surpassing the connection limit, thus throwing those errors.

The solution is quite simple; we just need to stop the loop started by the listen function from the Reactor object before calling the driver.quit() function. It can be done like this:

import undetected_chromedriver as uc
import time

driver = uc.Chrome(
    enable_cdp_events=True,
    headless=True
)

print(f"is reactor loop closed? {driver.reactor.loop.is_closed()}")
# >>> is reactor loop closed? False

while not driver.reactor.loop.is_closed():
    try:
        driver.reactor.loop.close()
    except:
        driver.reactor.event.set()
        time.sleep(0.5)

print(f"is reactor loop closed? {driver.reactor.loop.is_closed()}")
# >>> is reactor loop closed? True

driver.quit()
dylankeep commented 1 year ago

Thanks for your help @pedro-peixot0 , this is the real reason, I also tried it after got your reply, it works fine. Nice~

pedro-peixot0 commented 1 year ago

458

I edited my previous comment after testing a bit more and reading what @avasilkov wrote in this issue. Now I think things are fixed