seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
5.03k stars 945 forks source link

the script is detected as bot #3059

Closed zqxyus closed 2 weeks ago

zqxyus commented 2 weeks ago

I first used code similar to the one below to open a website that I need to crawl. But access is blocked and prohibited. So I used the following code to visit https://antoinevastel.com/bots/. The running results show that the following code is detected as a bot.

'''from seleniumbase import SB with SB(uc=True, incognito=True, test=True) as sb: url="https://antoinevastel.com/bots/" server="f.proxys5.net:6200", username= "00007-zone-custom-region-DE-sessid-NkivelA2-sessTime-15",#scrapeops password= "tHx19d0nTan" sb.set_wire_proxy(f"{username}:{password}@{server}") driver=sb.driver.uc_open_with_reconnect(url, 21) sb.sleep(93) '''

The results are listed as follows: Consistent: The scanner did not detect any anomaly. Unsure: The scanner considers that the attributes tested could indicate the presence of a bot, but there is still a chance that it is a human. Inconsistent: The scanner considers that the attributes tested indicate the presence of a bot. Test Result Data

PHANTOM_UA | Consistent | {"userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"}

PHANTOM_PROPERTIES | Consistent | {"attributesFound":[false,false,false]}

PHANTOM_ETSL | Consistent | {"etsl":33}

PHANTOM_LANGUAGE | Consistent | {"languages":["en-US"]}

PHANTOM_WEBSOCKET | Consistent | {}

MQ_SCREEN | Consistent | {}

PHANTOM_OVERFLOW | Consistent | {"depth":9649,"errorMessage":"Maximum call stack size exceeded","errorName":"RangeError","errorStacklength":711}

PHANTOM_WINDOW_HEIGHT | Consistent | {"wInnerHeight":709,"wOuterHeight":840,"wOuterWidth":1280,"wInnerWidth":1236,"wScreenX":80,"wPageXOffset":0,"wPageYOffset":0,"cWidth":1221,"cHeight":812,"sWidth":1920,"sHeight":1080,"sAvailWidth":1850,"sAvailHeight":1053,"sColorDepth":24,"sPixelDepth":24,"wDevicePixelRatio":1}

HEADCHR_UA | Consistent | {"userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"}

WEBDRIVER | Inconsistent | {}

HEADCHR_CHROME_OBJ | Consistent | {}

HEADCHR_PERMISSIONS | Consistent | {}

HEADCHR_PLUGINS | Consistent | {"plugins":["PDF Viewer::Portable Document Format::internal-pdf-viewer::application/pdf~pdf~Portable Document Format,text/pdf~pdf~Portable Document Format","Chrome PDF Viewer::Portable Document Format::internal-pdf-viewer::application/pdf~pdf~Portable Document Format,text/pdf~pdf~Portable Document Format","Chromium PDF Viewer::Portable Document Format::internal-pdf-viewer::application/pdf~pdf~Portable Document Format,text/pdf~pdf~Portable Document Format","Microsoft Edge PDF Viewer::Portable Document Format::internal-pdf-viewer::application/pdf~pdf~Portable Document Format,text/pdf~pdf~Portable Document Format","WebKit built-in PDF::Portable Document Format::internal-pdf-viewer::__application/pdf~pdf~Portable Document Format,text/pdf~pdf~Portable Document Format"]}

HEADCHR_IFRAME | Consistent | {}

CHR_DEBUG_TOOLS | Consistent | {}

SELENIUM_DRIVER | Consistent | {"attributesFound":[false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false]}

CHR_BATTERY | Consistent | {}

CHR_MEMORY | Consistent | {}

TRANSPARENT_PIXEL | Consistent | {"0":0,"1":0,"2":0,"3":0}

SEQUENTUM | Consistent | {}

VIDEO_CODECS | Consistent | {"h264":"probably"}

How to get around it? thanks!

mdmintz commented 2 weeks ago

Duplicate of https://github.com/seleniumbase/SeleniumBase/issues/1912.

When I run the following script, I get the same result as when using a regular Chrome browser, so the Inconsistent value there isn't accurate.

from seleniumbase import SB

with SB(uc=True, incognito=True, test=True) as sb:
    url = "https://antoinevastel.com/bots/"
    sb.uc_open_with_reconnect(url, 8)

The https://pixelscan.net/ website is a better test for bots. SeleniumBase UC Mode goes undetected.

from seleniumbase import SB

with SB(uc=True, incognito=True, test=True) as sb:
    url = "https://pixelscan.net/"
    sb.uc_open_with_reconnect(url, 10)
    sb.remove_elements("jdiv")  # Remove chat widgets
    sb.assert_text("No automation framework detected", "pxlscn-bot-detection")
    not_masking = "You are not masking your fingerprint"
    sb.assert_text(not_masking, "pxlscn-fingerprint-masking")
    sb.highlight("span.text-success", loops=8)
    sb.sleep(1)
    sb.highlight("pxlscn-fingerprint-masking div", loops=9, scroll=False)
    sb.sleep(1)
    sb.highlight("div.bot-detection-context", loops=10, scroll=False)
    sb.sleep(2)
zqxyus commented 2 weeks ago

I used the following code, the access is blocked.

from seleniumbase import SB
with SB(uc=True, incognito=True, test=True) as sb:
    url="https://rendezvousparis.hermes.com/client/welcome"
    sb.uc_open_with_reconnect(url, 10)
    sb.sleep(3)
mdmintz commented 2 weeks ago

That page blocked me in my regular Chrome browser (no Selenium). Also, that's not a Cloudflare page. UC Mode is specifically designed for Cloudflare-bypass right now, and some other anti-bot sites.

zqxyus commented 2 weeks ago

@mdmintz how to crack it ? would you like to give me any ideas or guidelines ? Thanks !

mdmintz commented 2 weeks ago

You can try changing your proxy settings, but otherwise there's not much that can be done if it blocks regular Chrome browsers.

zqxyus commented 2 weeks ago

@mdmintz Thank you! I have another question, little information about proxy server setting is found in seleniumbase documentation. The following code is a demo code of proxy server setting based on selenium. If i use seleniumbase, how to set the proxy server?

from selenium import webdriver
def setup_driver():
    # ScrapeOps Proxy setup
    proxy_url = "proxy.scrapeops.io:8000" 
    api_key = "YOUR_API_KEY"  # Replace this with your ScrapeOps API key
    target_url = "http://mywebsite.com/"  
    bypass_level = "generic_level_1"  # Choose the appropriate bypass level

    # Set up Selenium with the ScrapeOps Proxy
    proxy = f"http://{api_key}:{proxy_url}/?target_url={target_url}&bypass={bypass_level}"
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument(f'--proxy-server={proxy}')

    # Initialize the WebDriver
    driver = webdriver.Chrome(options=chrome_options)
    return driver

def main():
    driver = setup_driver()
    driver.get(target_url) 
    #E.g let's extract title of the webpage
    print("Page title:", driver.title) 
    driver.quit()

if __name__ == '__main__':
    main()
mdmintz commented 2 weeks ago

Set the proxy arg: https://github.com/seleniumbase/SeleniumBase/blob/119ec4bcf38b45d78e77680816aed6d1c24b9b52/seleniumbase/plugins/sb_manager.py#L41