seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
4.46k stars 909 forks source link

Suddenly unable to bypass CloudFlare challenge (Ubuntu Server) #2842

Closed Jobine23 closed 1 week ago

Jobine23 commented 3 weeks ago

Hello, overnight my instances of seleniumbase became unable to bypass the CloudFlare challenge ( which uses CloudFlare turnstile ).

I was using an older version of SB so I updated to latest ( 4.27.4 ), and it is still not passing the challenge.

cloudflare_chal

I am using your demo code for clicking on the CloudFlare turnstile captcha:

from seleniumbase import SB

def open_the_turnstile_page(sb):
    url = "https://wildbet.gg/"
    sb.driver.uc_open_with_reconnect(url, reconnect_time=5)

def click_turnstile_and_verify(sb):
    sb.switch_to_frame("iframe")
    sb.driver.uc_click("span")
    sb.assert_element("img#captcha-success", timeout=3)

with SB(uc=True, test=True) as sb:
    open_the_turnstile_page(sb)
    try:
        click_turnstile_and_verify(sb)
    except Exception:
        open_the_turnstile_page(sb)
        click_turnstile_and_verify(sb)
    sb.set_messenger_theme(location="top_left")
    sb.post_message("SeleniumBase wasn't detected", duration=3)

if I instead use: sb.driver.uc_open_with_reconnect(url, reconnect_time=9999)

and click manually, it works. This means they are detecting something ?

I also tried adding reconnect_time=5 on uc_click and it did not help.

I'm a big fan of your project and I've been using it for some time :)

JimKarvo commented 3 weeks ago
from seleniumbase import SB

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
with SB(uc=True, test=True, disable_features="UserAgentClientHint", agent=ua) as sb:
    print("getting req catcher") 
    url = "https://jimkarvo.requestcatcher.com/test"
    sb.driver.uc_open_with_reconnect(url, 1)
    breakpoint()

@JimKarvo your reconnect duration seems too small, make it bigger like 7 / 8, in my case i use 20 and it works just fine

the above code, it's just for getting the user-agent and all data that browser sends to a server while request a page.

mdmintz commented 3 weeks ago

@JimKarvo This site is a good one for seeing all the headers: https://browserleaks.com/client-hints @jens4626 My Windows machine had the same UA, and bypassed without issue. @sakarimov Seeing similar. Changing the User Agent makes a difference.

So what have we learned? Cloudflare made changes. Previously, they only blocked you if they detected Selenium, but now they are blocking you for other things, such as User Agent.

Three types of User Agents now (in combination with UC Mode):

You may have to change your User Agent on Linux to be "Good".

For the "Not that good", you'll need to use pyautogui to click. (They are currently detecting any type of JS used to click Turnstile checkboxes, even with uc_click.)

This should work every time as long as the machine has a GUI:

import pyautogui
from seleniumbase import SB

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"

with SB(uc=True, test=True, agent=ua, disable_features="UserAgentClientHint", incognito=True) as sb:
    url = "https://www.virtualmanager.com/da/login"
    sb.driver.uc_open_with_reconnect(url, 8)
    if sb.is_element_visible("iframe"):
        sb.switch_to_frame("iframe")
        sb.execute_script('document.querySelector("input").focus()')
        sb.disconnect()
        pyautogui.press(" ")
        sb.driver.reconnect(4)
    breakpoint()
jens4626 commented 3 weeks ago

Thanks @mdmintz! Workaround using pyautogui does indeed work - I do hope to see a fix using uc_click fix soon!

jens4626 commented 3 weeks ago

Sadly the pyautogui does not always seems to bypass. Hope you're working on a uc_click fix :)

mdmintz commented 3 weeks ago

@jens4626 Make sure all your pyautogui actions happen after the sb.disconnect(). Then when done, call sb.connect() / sb.reconnect() to use Selenium actions again.

jens4626 commented 3 weeks ago

@mdmintz It does work - but its only 50% chance that it works.

I currently have 3 situations:

  1. The space bar presses doesn't get recognized so it won't bypass - I tried adding a few more hoping it would solve it, but negative.
  2. It does recognize space bar pressing as click but Cloudflare detects it, so it will loop through again.
  3. It sends the space bar and it bypasses Cloudflare.

I did use that code you provided and it works - but just not always.

So not sure what to do tbh.

` sb.switch_to_frame("iframe") print("Switched to iframe")

        # Waiting to ensure the iframe is loaded
        time.sleep(2)

        # Focus on the input element
        sb.execute_script('document.querySelector("input").focus()')
        time.sleep(2)

        # Disconnecting the SeleniumBase driver
        print("Disconnecting SB")
        sb.disconnect()
        time.sleep(2)

        # Press the space bar with a short delay in between
        pyautogui.press(" ")
        time.sleep(1)
        pyautogui.press(" ")
        time.sleep(1)
        pyautogui.press(" ")
        time.sleep(1)
        pyautogui.press(" ")
        print("Pressed space four times")

        # Waiting for the actions to complete
        time.sleep(2)

        # Reconnecting the SeleniumBase driver
        print("Reconnecting SB")
        sb.driver.reconnect(4)

`

mdmintz commented 3 weeks ago

@jens4626 The spacebar from pyautogui might not get recognized if your Selenium window is not the active window on top. Try that (making sure the window is on top and active) while I'm still working on improvements...

ismayilibrahimov commented 2 weeks ago

This was also not working for me when disconnecting from remote windows 10 (azure vm). So as @mdmintz mentioned, we have to keep chrome window active.

Updated code:

from seleniumbase import SB

ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"

with SB(uc=True, test=True, agent=ua, disable_features="UserAgentClientHint", incognito=True) as sb:
    sb.driver.maximize_window()
    url = "https://www.virtualmanager.com/da/login"
    sb.driver.uc_open_with_reconnect(url, 8)
    if sb.is_element_visible("iframe"):
        sb.switch_to_frame("iframe")
        sb.execute_script('document.querySelector("input").focus()')
        sb.disconnect()
        pyautogui.press(" ")
        sb.driver.reconnect(4)
    breakpoint()

As a note, when I disconnect from remote desktop, windows gui is disabled. So, in order to keep your current session active, you can use this instruction.

SaberTawfiq commented 2 weeks ago

I test https://github.com/sarperavci/CloudflareBypassForScraping the script uses the DrissionPage When you open the browser normally, it automatically succeeds without clicking to confirm that you are a human. When verify you are human appears, he clicks on the box and success is given. Can you merge or modify uc_click to work on the same principle on the Seleniumbase

EnmeiRyuuDev commented 2 weeks ago

I test https://github.com/sarperavci/CloudflareBypassForScraping the script uses the DrissionPage When you open the browser normally, it automatically succeeds without clicking to confirm that you are a human. When verify you are human appears, he clicks on the box and success is given. Can you merge or modify uc_click to work on the same principle on the Seleniumbase

I confirm, the DrissionPage solution bypasses the cloudflare click under Linux.

EnmeiRyuuDev commented 2 weeks ago

The pyautogui.press(" ") solution works consistently as well under Linux. You can make it work in headless mode, and in a multi-process environment, by attaching it to a virtual display. This code worked for me under Ubuntu/ Debian (note that headed=True; but Selenium will run anyway headless)..

from seleniumbase import SB
import pyautogui
from pyvirtualdisplay.display import Display
disp = Display(visible=True, size=(1366, 768), backend="xvfb", use_xauth=True)
disp.start()

import Xlib.display
pyautogui._pyautogui_x11._display = Xlib.display.Display(os.environ['DISPLAY'])

with SB(uc=True, headed=True) as sb:
     ...
mdmintz commented 2 weeks ago

@EnmeiRyuuDev SeleniumBase uses the built-in sbvirtualdisplay like this:

self._xvfb_display = Display(visible=0, size=(width, height))
self._xvfb_display.start()

Will that work with the code you added?

import Xlib.display
pyautogui._pyautogui_x11._display = Xlib.display.Display(os.environ['DISPLAY'])

I assume you installed this: python-xlib? Does pyautogui need that to succeed on Linux?

EnmeiRyuuDev commented 2 weeks ago

@mdmintz in my tests, this piece of code was required:

from pyvirtualdisplay.display import Display
disp = Display(visible=True, size=(1366, 768), backend="xvfb", use_xauth=True)
disp.start()

Otherwise, Selenium will not run headless.

Also, headed=True was required but still SB runs headless which is perfect, otherwise pyautogui will not work. Also, the code will not work under Windows, only valid under Linux. I remember I only had to install the pyvirtualdisplay. Also in the Linux environment, some packages are necessary: sudo apt-get install python3.10-tk python3-dev tk-dev And rebuilding the Python afterwards:

sudo ./configure --enable-optimizations
sudo make -j 2
sudo make altinstall

This is my complete test code:

import os
from seleniumbase import SB
import time
import sys
import random
import math
import pyautogui

from pyvirtualdisplay.display import Display
disp = Display(visible=True, size=(1366, 768), backend="xvfb", use_xauth=True)
disp.start()

import Xlib.display
pyautogui._pyautogui_x11._display = Xlib.display.Display(os.environ['DISPLAY'])

with SB(uc=True, headed=True, proxy=None) as sb:
    print('Started..')
    url = "https://gitlab.com/users/sign_in"
    sb.driver.uc_open_with_reconnect(url, 10)
    if sb.is_element_visible("iframe"):
        sb.switch_to_frame("iframe")
        sb.execute_script('document.querySelector("input").focus()')
        sb.disconnect()
        print('Click..')
        pyautogui.press(" ")
        sb.driver.reconnect(10)
    random_number = random.randint(1000, 9999)
    filename = f"screenshot_{random_number}.png"
    sb.save_screenshot(filename)
    print('End.')

What was interesting, is that when running multiple headless instances (+20 chrome driver instances), they all click independently without that window overlapping issue.

jens4626 commented 2 weeks ago

Thanks for the input @mdmintz and @ismayilibrahimov.

I think the issue with not clicking was due to me. But I still face issues with it not being able to bypass as you can see: https://github.com/seleniumbase/SeleniumBase/assets/45258332/ee31069b-8d09-4210-9d70-cac90e5a4b18

It might be caused by bad IP score and now using pyautogui - it was never an issue with uc.click.

I was already using what you mentioned @ismayilibrahimov so thats not the problem either.

mdmintz commented 2 weeks ago

More details:

Now, if your User-Agent looks untrustworthy, CloudFlare makes you click the CAPTCHA (which has been improved). If they detect either Selenium in the browser or JavaScript involvement in clicking the CAPTCHA, they don't let the click through. That's why pyautogui is now required for clicking the CAPTCHA if the User-Agent isn't trustworthy enough. The default user-agent set on macOS and Windows by SeleniumBase is generally good enough. On Linux, the default User-Agent might not be good enough: You may need to specify a better one to avoid needing to click the CAPTCHA in that scenario. (Or just use the pyautogui workaround for clicking it... Scroll up to see some examples that use it.)

I'm working on an update that can optionally utilize the pyautogui workaround if needed. That will likely need an update to examples because the existing uc_click might not be good enough if the User-Agent isn't trustworthy enough.

This probably means a new UC Mode Video Tutorial (Part 3) is likely to happen soon to explain the changes.

ismayilibrahimov commented 2 weeks ago

I am using windows 10 (without headless mode) at azure, and CloudFlare requires to click. I think user-agent is not the only issue

More details:

Now, if your User-Agent looks untrustworthy, CloudFlare makes you click the CAPTCHA (which has been improved). If they detect either Selenium in the browser or JavaScript involvement in clicking the CAPTCHA, they don't let the click through. That's why pyautogui is now required for clicking the CAPTCHA if the User-Agent isn't trustworthy enough. The default user-agent set on macOS and Windows by SeleniumBase is generally good enough. On Linux, the default User-Agent might not be good enough: You may need to specify a better one to avoid needing to click the CAPTCHA in that scenario. (Or just use the pyautogui workaround for clicking it... Scroll up to see some examples that use it.)

I'm working on an update that can optionally utilize the pyautogui workaround if needed. That will likely need an update to examples because the existing uc_click might not be good enough if the User-Agent isn't trustworthy enough.

This probably means a new UC Mode Video Tutorial (Part 3) is likely to happen soon to explain the changes.

mdmintz commented 2 weeks ago

@ismayilibrahimov Azure has a known IP-range (just like AWS or GCP). That's why residential proxies have become so popular lately for web-scraping.

enricodvn commented 2 weeks ago

Were you guys able to use this alternative with proxy?

So, for me CF started showing the challenge the same time around, and it only happens when I am using proxy (on servers).

Local without proxy it works fine. But when I use proxy, even on local env, bam there is the captcha. They somehow are detecting the proxy.

If I try to use the alternative with pyautogui, it works without proxy, but if I use proxy this is what happens:

asd1

mdmintz commented 2 weeks ago

@enricodvn Which alternative are you using? The one with Xlib.display? As for proxies, I haven't seen any local issues with using them, although maybe it works better when the time zone of the proxy is in the same time zone as your browser.

enricodvn commented 2 weeks ago

Yes, the last one from https://github.com/seleniumbase/SeleniumBase/issues/2842#issuecomment-2168829685.

Hmm, this time zone setting is interesting, anyway I can set it through driver?

I will try to tweak with it.

mdmintz commented 2 weeks ago

@enricodvn The time zone can be set via execute_cdp_cmd, but CDP changes go away when the driver is disconnected. Would need a way to configure it before the browser is launched. (Also possible that it's unrelated to the time zone difference.)

ankushkumarpatiyal commented 1 week ago
def open_the_turnstile_page(sb,url):
    url = url
    sb.driver.uc_open_with_reconnect(url, 15)
    print('getting websie')
    screen = 'new.jpg'
    sb.save_screenshot(os.path.join(settings.MEDIA_ROOT,screen))
    if sb.is_element_visible("iframe"):
        print('inside if')
        sb.switch_to_frame("iframe")
        sleep(1)
        print('iframe found')
        sb.execute_script('document.querySelector("input").focus()')
        sb.disconnect()
        pyautogui.press(" ")
        print('pressed')
        file='new_screenshot.png'
        print('file saved')
        sb.driver.reconnect(3)
        sb.save_screenshot(os.path.join(settings.MEDIA_ROOT,file))
    random_number = random.randint(1000, 9999)
    filename = f"screenshot_{random_number}.png"
    sb.save_screenshot(os.path.join(settings.MEDIA_ROOT,filename))
    print('saved screenshot')
    page_source = sb.get_page_source()
    print(page_source)

    most of the time after pyautogui presses and i get print statement in my console('pressed') my code stops there like it doesnt fail but still it gets stuck, why? can anyone help me out here and i am on ubuntu server and on local machine everything works just fine 
mdmintz commented 1 week ago

This was resolved in 4.28.0 - https://github.com/seleniumbase/SeleniumBase/releases/tag/v4.28.0

Read https://github.com/seleniumbase/SeleniumBase/issues/2865 for all the details. You may need to use the new UC Mode methods in 4.28.0, such as driver.uc_gui_handle_cf(), in order to successfully click through CF CAPTCHA checkboxes on Linux.