ultrafunkamsterdam / undetected-chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
https://github.com/UltrafunkAmsterdam/undetected-chromedriver
GNU General Public License v3.0
9.57k stars 1.14k forks source link

UC got detected #696

Open realdronos opened 2 years ago

realdronos commented 2 years ago

Hello, there is a server on ubuntu with python and UC + google chrome. Until the last browser update, the script on the server was working fine. After the update the script began to give an error:

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div[6]/div[3]/div/div/div/div/main/div/div[2]/div[3]/div/div/div/ul/li[3]"} (Session info: headless chrome=103.0.5060.53)

Put headless False and took screenshot - white screen. I suspect that the site began to detect UC. As a solution I want to downgrade Chrome to a previous version, so the question is where to find previous versions of uUC and how to install correct version?

Current versions: undetected-chromedriver - 3.1.5.post4 selenium - 4.1.3 Google Chrome - 103.0.5060.53 chromedriver - https://chromedriver.storage.googleapis.com/index.html?path=103.0.5060.53/

ps. Script on the computer starts and runs without any problems.

Code for test:

import pandas as pd
import numpy as np
import re
import time
import random
from fake_useragent import UserAgent
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import InvalidSelectorException
import undetected_chromedriver as uc
from selenium.webdriver.support.ui import Select
from datetime import datetime

# useragent = UserAgent()
user_agent_list = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36'
]

# options
options = webdriver.ChromeOptions()

# user-agent
options.add_argument(f"user-agent={random.choice(user_agent_list)}")

# disable webdriver
options.add_argument("--disable-blink-features=AutomationControlled")

# headless mode
options.add_argument("--headless")

# added modes
# options.add_argument("start-maximized")
options.add_argument("--no-sandbox")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--window-size=1920x1080")
# options.add_argument("--disable-browser-side-navigation")
options.add_argument("--disable-notifications")
options.add_argument("--disable-gpu")
options.add_argument('--verbose')

if __name__ == '__main__':
    driver = uc.Chrome(executable_path= '/root/venv/bin/parse/chromedriver', options=options)

    url = 'https://spb.vseinstrumenti.ru/instrument/shurupoverty/akkumulyatornye-dreli/'
    page = 1
    goods_len = 1
    df_vi1 = pd.DataFrame(columns=['id', 'good', 'url', 'price'])

    driver.get(url=url)
    driver.implicitly_wait(1)

    catalog_wrapper = driver.find_elements(By.XPATH,
                                           '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[3]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[4]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[3]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[3]/div[2]/a')
    goods_len = len(catalog_wrapper)
    print(f'URL: {url}, Goods_available_1: {goods_len} \n')

    number_count = driver.find_elements(By.XPATH, '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[2]/div[4]/p[1]')
    number = len(number_count)
    for number in range(number):
        button_plitka = driver.find_element(By.XPATH, '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[2]/div[3]/a')
        driver.execute_script("arguments[0].click();", button_plitka)
        if number == 0:
            break

    driver.implicitly_wait(2)
    goods_count = driver.find_element(By.XPATH,
                                      '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[2]/div[3]/div/div/div/ul/li[3]')
    driver.execute_script("arguments[0].click();", goods_count)

    driver.implicitly_wait(2)
    catalog_wrapper = driver.find_elements(By.XPATH,
                                           '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[3]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[4]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[3]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[3]/div[2]/a')
    goods_len = len(catalog_wrapper)
    print(f'URL: {url}, Goods_available_2: {goods_len} \n')

    driver.close()
    driver.quit()

    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    print('Done at:', current_time)
sebdelsol commented 2 years ago

No it's not :

EDIT :

My test code, it scrapped those 1000 items without a hitch :

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

import undetected_chromedriver as uc

if __name__ == "__main__":
    options = uc.ChromeOptions()
    options.headless = True  # you're lucky headless works for this site... for now
    driver = uc.Chrome(options=options)
    wait = WebDriverWait(driver, 5)

    url = "https://spb.vseinstrumenti.ru/instrument/shurupoverty/akkumulyatornye-dreli/"
    x_items = '//div[@class="listing-grid"][1]//div[contains(@class, "product-tile grid-item")]'
    x_item_infos = '//div[@class="column-right"]'
    x_item_available = './/ul[contains(@class, "product-delivery")]/li[1]/span/span[1]'
    x_item_href = './/div[@class="image"]/a'
    x_item_name = './/div[@class="title"]'
    no_page = 1
    page = ""

    while True:
        driver.get(f"{url}{page}")
        try:
            items = wait.until(EC.presence_of_all_elements_located((By.XPATH, x_items)))
        except TimeoutException:
            break

        for i, (href, name) in enumerate(
            [
                (
                    item.find_element(By.XPATH, x_item_href).get_attribute("href"),
                    item.find_element(By.XPATH, x_item_name).text,
                )
                for item in items
            ]
        ):
            driver.get(href)
            infos = wait.until(EC.presence_of_element_located((By.XPATH, x_item_infos)))
            available = infos.find_elements(By.XPATH, x_item_available)
            available = available[0].text if available else "not available"
            print(f"{no_page}-{i} - {name} - {available}")

        no_page += 1
        page = f"page{no_page}/"

    driver.quit()
realdronos commented 2 years ago

Thanks for xpath hints, but still white sceenshot with headless "True" mode when starts on server. Even no those prints "print(f"{no_page}-{i} - {name} - {available}")" If headless "False" get error:

Traceback (most recent call last):
  File "/root/venv/bin/parse/_Parse_Vseinstrumenti_test.py", line 10, in <module>
    driver = uc.Chrome(options=options)
  File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/__init__.py", line 401, in __init__
    super(Chrome, self).__init__(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
    super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
    RemoteWebDriver.__init__(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 269, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/__init__.py", line 589, in start_session
    super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 425, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:38899
from chrome not reachable
sebdelsol commented 2 years ago

Your Chrome is not reachable, this has nothing to do with this script. You should update your Chrome... Google bumped Chrome major version to 103.

realdronos commented 2 years ago

As i wrote, versions on server:

undetected-chromedriver - 3.1.5.post4 selenium - 4.1.3 Google Chrome - 103.0.5060.53 chromedriver - https://chromedriver.storage.googleapis.com/index.html?path=103.0.5060.53/

Another website works fine

sebdelsol commented 2 years ago

Your env is broken... Are you sure you used the exact same script as provided ? This site seems barely protected : I just scrapped again those 1000 items and I'm still not detected. And by the way I edited my code to use more efficient explicit waits.

EDIT : So I know this is not an issue with UC... please close this issue, it has nothing to do with UC.