seleniumbase / SeleniumBase

📊 Blazing fast Python framework for web crawling, scraping, testing, and reporting. Supports pytest. Stealth abilities: UC Mode and CDP Mode.
https://seleniumbase.io
MIT License
5.39k stars 979 forks source link

The CF CAPTCHAs changed again (on Linux) #3111

Closed mdmintz closed 2 months ago

mdmintz commented 2 months ago

The CF CAPTCHAs changed again (on Linux)

CI started failing:

Screenshot 2024-09-09 at 10 43 29 AM

This is how it normally looks when passing:

(PyAutoGUI clicks the CAPTCHA successfully, and then takes you to the real page.)

Screenshot 2024-09-09 at 10 45 10 AM

I'm looking into what changed. Changes come frequently, as you may have seen in UC Mode Video 3: https://www.youtube.com/watch?v=-EpZlhGWo9k, where I talked about "The Great CAPTCHA Duel".

If you figure out what changed before I do, let me know.

mdmintz commented 2 months ago

For discussion, come join us on Discord: https://discord.com/invite/HDk5wYvzEZ.

mdmintz commented 2 months ago

Still trying to figure it out. I even tried nodriver, but that didn't bypass the CAPTCHA on Linux either:

import nodriver
import time
from sbvirtualdisplay.display import Display

async def main():
    browser = await nodriver.start()
    page = await browser.get("https://gitlab.com/users/sign_in")
    time.sleep(4)
    print(await page.evaluate("document.title"))
    await page.save_screenshot("screenshot.png")

if __name__ == "__main__":
    disp = Display(
        visible=True, size=(1366, 768), backend="xvfb", use_xauth=True
    )
    disp.start()
    nodriver.loop().run_until_complete(main())
    disp.stop()

Similar to SeleniumBase, it also bypasses the CAPTCHA on macOS / Windows.

Maybe CF blocked all Linux access? Or they figured out how to do fingerprinting well (and can now determine the difference between a Desktop Linux machine with a GUI versus a GUI-less Linux Server). Will probably sleep on it. Ideas are welcome. At least automation can still bypass CAPTCHAs on macOS / Windows, meaning that web-scraping servers will need to run there now if the situation isn't handled.

mdmintz commented 2 months ago

Looks like we're just dealing with good old-fashioned IP-Address-blocking. GitHub Actions IP Addresses are now known by CF, and their Turnstiles won't let you through if they spot browsers coming from those IPs (or other known server IPs).

The solution is to change proxy settings to a "safe" IP Address via the proxy arg. Maybe that means using residential proxies, or "special" server IP Addresses that aren't on some block list.

mdmintz commented 2 months ago

As it turns out, CF isn't fully blocking on IP Addresses. They're just making you do more work to click the CAPTCHAs.

This worked in GitHub Actions: (Coordinates will be different depending on the site and the environment.)

from seleniumbase import SB

with SB(uc=True, test=True) as sb:
    import pyautogui
    url = "https://www.virtualmanager.com/en/login"
    sb.uc_open_with_disconnect(url)
    sb.sleep(6)
    pyautogui.moveTo(228, 387, 1.05, pyautogui.easeOutQuad)
    sb.sleep(0.056)
    pyautogui.click()
    sb.sleep(3)
    sb.reconnect()
    print(sb.get_page_title())
Screenshot 2024-09-10 at 11 48 23 AM

Which means you need to either:

  1. Have a good IP Address. (Set proxy to change it.)
  2. OR: Be fully disconnected and click the CAPTCHA.
mdmintz commented 2 months ago

Here's a way to do it without knowing the coordinates in advance:

from seleniumbase import SB
from seleniumbase import config as sb_config

with SB(uc=True, test=True) as sb:
    import pyautogui
    url = "https://www.virtualmanager.com/en/login"
    sb.uc_open_with_reconnect(url, 6)
    print(sb.get_page_title())
    sb.uc_gui_click_captcha()
    print(sb.get_page_title())
    if (
        "Just a moment" in sb.get_page_title()
        and hasattr(sb_config, "_saved_cf_x_y")
    ):
        sb.uc_open_with_disconnect(url)
        sb.sleep(4)
        pyautogui.click(sb_config._saved_cf_x_y)
        sb.sleep(3)
        sb.reconnect()
    print(sb.get_page_title())

Just swap the URL for the one you need. Eg. https://gitlab.com/users/sign_in

from seleniumbase import SB
from seleniumbase import config as sb_config

with SB(uc=True, test=True) as sb:
    import pyautogui
    url = "https://gitlab.com/users/sign_in"
    sb.uc_open_with_reconnect(url, 6)
    print(sb.get_page_title())
    sb.uc_gui_click_captcha()
    print(sb.get_page_title())
    if (
        "Just a moment" in sb.get_page_title()
        and hasattr(sb_config, "_saved_cf_x_y")
    ):
        sb.uc_open_with_disconnect(url)
        sb.sleep(4)
        pyautogui.click(sb_config._saved_cf_x_y)
        sb.sleep(3)
        sb.reconnect()
    print(sb.get_page_title())

Limitations: Multithreaded scripts where more than one window is automated at the same time.

Otherwise, Linux users can use this on the current version of SeleniumBase. (Quite possibly, this will need to be used soon on macOS and Windows too.)

mdmintz commented 2 months ago

This script is all you need to bypass CF on GitHub Actions:

with SB(uc=True, test=True) as sb:
    url = "https://gitlab.com/users/sign_in"
    sb.uc_open_with_reconnect(url, 4)
    sb.uc_gui_click_captcha()
    print(sb.get_page_title())

Swap the URL above for the one you need.


Also, SeleniumBase 4.30.4 is here. (You'll see some Linux improvements.)