UC got detected - Githubissues

realdronos commented 2 years ago

Hello, there is a server on ubuntu with python and UC + google chrome. Until the last browser update, the script on the server was working fine. After the update the script began to give an error:

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/div[6]/div[3]/div/div/div/div/main/div/div[2]/div[3]/div/div/div/ul/li[3]"} (Session info: headless chrome=103.0.5060.53)

Put headless False and took screenshot - white screen. I suspect that the site began to detect UC. As a solution I want to downgrade Chrome to a previous version, so the question is where to find previous versions of uUC and how to install correct version?

Current versions: undetected-chromedriver - 3.1.5.post4 selenium - 4.1.3 Google Chrome - 103.0.5060.53 chromedriver - https://chromedriver.storage.googleapis.com/index.html?path=103.0.5060.53/

ps. Script on the computer starts and runs without any problems.

Code for test:

import pandas as pd
import numpy as np
import re
import time
import random
from fake_useragent import UserAgent
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import InvalidSelectorException
import undetected_chromedriver as uc
from selenium.webdriver.support.ui import Select
from datetime import datetime

# useragent = UserAgent()
user_agent_list = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36'
]

# options
options = webdriver.ChromeOptions()

# user-agent
options.add_argument(f"user-agent={random.choice(user_agent_list)}")

# disable webdriver
options.add_argument("--disable-blink-features=AutomationControlled")

# headless mode
options.add_argument("--headless")

# added modes
# options.add_argument("start-maximized")
options.add_argument("--no-sandbox")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--window-size=1920x1080")
# options.add_argument("--disable-browser-side-navigation")
options.add_argument("--disable-notifications")
options.add_argument("--disable-gpu")
options.add_argument('--verbose')

if __name__ == '__main__':
    driver = uc.Chrome(executable_path= '/root/venv/bin/parse/chromedriver', options=options)

    url = 'https://spb.vseinstrumenti.ru/instrument/shurupoverty/akkumulyatornye-dreli/'
    page = 1
    goods_len = 1
    df_vi1 = pd.DataFrame(columns=['id', 'good', 'url', 'price'])

    driver.get(url=url)
    driver.implicitly_wait(1)

    catalog_wrapper = driver.find_elements(By.XPATH,
                                           '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[3]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[4]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[3]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[3]/div[2]/a')
    goods_len = len(catalog_wrapper)
    print(f'URL: {url}, Goods_available_1: {goods_len} \n')

    number_count = driver.find_elements(By.XPATH, '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[2]/div[4]/p[1]')
    number = len(number_count)
    for number in range(number):
        button_plitka = driver.find_element(By.XPATH, '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[2]/div[3]/a')
        driver.execute_script("arguments[0].click();", button_plitka)
        if number == 0:
            break

    driver.implicitly_wait(2)
    goods_count = driver.find_element(By.XPATH,
                                      '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[2]/div[3]/div/div/div/ul/li[3]')
    driver.execute_script("arguments[0].click();", goods_count)

    driver.implicitly_wait(2)
    catalog_wrapper = driver.find_elements(By.XPATH,
                                           '/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[6]/div[3]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[4]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[6]/div[2]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[3]/div[2]/a|/html/body/div[6]/div[2]/div/div/div/div/main/div/div[6]/div[*]/div[7]/div[2]/div[2]/a|/html/body/div[6]/div[3]/div/div/div/div/main/div/div[3]/div[*]/div[7]/div[3]/div[2]/a')
    goods_len = len(catalog_wrapper)
    print(f'URL: {url}, Goods_available_2: {goods_len} \n')

    driver.close()
    driver.quit()

    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    print('Done at:', current_time)

sebdelsol commented 2 years ago

No it's not :

Headless is often detected, there's a disclaimer in the readme about that. You're lucky it works in your case... for now.
Don't spoof your user agent, it's very easily detected especially OS and browser discrepancies.
Refrain from using all those unneeded options : you don't want a broken and/or detectable config.
If not headless you should keep --start-maximized though because it looks less suspicious to have a maximized viewport... unfortunately you spelled it wrong.
Please learn how to use xpath instead of copying unreliable absolute path from the Chrome DevTools.

EDIT :

Please use Selenium explicit waits it's more efficient.

My test code, it scrapped those 1000 items without a hitch :

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

import undetected_chromedriver as uc

if __name__ == "__main__":
    options = uc.ChromeOptions()
    options.headless = True  # you're lucky headless works for this site... for now
    driver = uc.Chrome(options=options)
    wait = WebDriverWait(driver, 5)

    url = "https://spb.vseinstrumenti.ru/instrument/shurupoverty/akkumulyatornye-dreli/"
    x_items = '//div[@class="listing-grid"][1]//div[contains(@class, "product-tile grid-item")]'
    x_item_infos = '//div[@class="column-right"]'
    x_item_available = './/ul[contains(@class, "product-delivery")]/li[1]/span/span[1]'
    x_item_href = './/div[@class="image"]/a'
    x_item_name = './/div[@class="title"]'
    no_page = 1
    page = ""

    while True:
        driver.get(f"{url}{page}")
        try:
            items = wait.until(EC.presence_of_all_elements_located((By.XPATH, x_items)))
        except TimeoutException:
            break

        for i, (href, name) in enumerate(
            [
                (
                    item.find_element(By.XPATH, x_item_href).get_attribute("href"),
                    item.find_element(By.XPATH, x_item_name).text,
                )
                for item in items
            ]
        ):
            driver.get(href)
            infos = wait.until(EC.presence_of_element_located((By.XPATH, x_item_infos)))
            available = infos.find_elements(By.XPATH, x_item_available)
            available = available[0].text if available else "not available"
            print(f"{no_page}-{i} - {name} - {available}")

        no_page += 1
        page = f"page{no_page}/"

    driver.quit()

realdronos commented 2 years ago

Thanks for xpath hints, but still white sceenshot with headless "True" mode when starts on server. Even no those prints "print(f"{no_page}-{i} - {name} - {available}")" If headless "False" get error:

Traceback (most recent call last):
  File "/root/venv/bin/parse/_Parse_Vseinstrumenti_test.py", line 10, in <module>
    driver = uc.Chrome(options=options)
  File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/__init__.py", line 401, in __init__
    super(Chrome, self).__init__(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 70, in __init__
    super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
    RemoteWebDriver.__init__(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 269, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/__init__.py", line 589, in start_session
    super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 360, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 425, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:38899
from chrome not reachable

sebdelsol commented 2 years ago

Your Chrome is not reachable, this has nothing to do with this script. You should update your Chrome... Google bumped Chrome major version to 103.

realdronos commented 2 years ago

As i wrote, versions on server:

undetected-chromedriver - 3.1.5.post4 selenium - 4.1.3 Google Chrome - 103.0.5060.53 chromedriver - https://chromedriver.storage.googleapis.com/index.html?path=103.0.5060.53/

Another website works fine

sebdelsol commented 2 years ago

Your env is broken... Are you sure you used the exact same script as provided ? This site seems barely protected : I just scrapped again those 1000 items and I'm still not detected. And by the way I edited my code to use more efficient explicit waits.

EDIT : So I know this is not an issue with UC... please close this issue, it has nothing to do with UC.

ultrafunkamsterdam / undetected-chromedriver

UC got detected #696