omkarcloud / botasaurus

The All in One Framework to build Awesome Scrapers.
https://www.omkar.cloud/botasaurus/
MIT License
1.14k stars 103 forks source link

Botasaurus can't pass CF #147

Open JimKarvo opened 4 days ago

JimKarvo commented 4 days ago

The CF seems that can detect the Botosaurus. There is no IP banned, there is no OS related problem. I have the same behavior on windows 11 and on ubuntu server.

If i emit the "wait" parameter, i get different error (like the "id" not found)

The script:

from botasaurus.browser import browser, Driver

@browser(add_arguments=['--no-sandbox'])
def scrape_heading_task(driver: Driver, data):
    # Visit the Omkar Cloud website
    driver.google_get("https://gitlab.com/users/sign_in", bypass_cloudflare=True, wait=10)

    # Retrieve the heading element's text
    heading = driver.get_text("h1")

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }

# Initiate the web scraping task
scrape_heading_task()

the log:

Traceback (most recent call last):
  File "/root/.venv/lib/python3.12/site-packages/botasaurus/browser_decorator.py", line 176, in run_task
    result = func(driver, data)
             ^^^^^^^^^^^^^^^^^^
  File "/root/pricecheckgrbots/delete.py", line 6, in scrape_heading_task
    driver.google_get("https://gitlab.com/users/sign_in", bypass_cloudflare=True, wait=10)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/driver.py", line 536, in google_get
    self.get_via(link, "https://www.google.com/", bypass_cloudflare=bypass_cloudflare, wait=wait)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/driver.py", line 522, in get_via
    self.detect_and_bypass_cloudflare()
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/driver.py", line 878, in detect_and_bypass_cloudflare
    bypass_if_detected(self)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/solve_cloudflare_captcha.py", line 122, in bypass_if_detected
    wait_till_cloudflare_leaves(driver, previous_ray_id, raise_exception)
  File "/root/.venv/lib/python3.12/site-packages/botasaurus_driver/solve_cloudflare_captcha.py", line 64, in wait_till_cloudflare_leaves
    raise CloudflareDetectionException()
botasaurus_driver.exceptions.CloudflareDetectionException: Cloudflare has detected us.

image

kreethandsouza commented 2 days ago

I tried out your code in my ubuntu system. works fine for me. If no luck probably try this out

from botasaurus.browser import browser, Driver
import time

@browser(add_arguments=['--no-sandbox'])
def scrape_heading_task(driver: Driver, data):
    driver.google_get("https://gitlab.com/users/sign_in")
    time.sleep(2)
    iframe = driver.select_iframe("#turnstile-wrapper iframe")
    checkbox = iframe.select('label', None)
    if checkbox:
        checkbox.click()
    driver.prompt()
    driver.save_screenshot()

    heading = driver.get_text("h1")
    return heading

# Initiate the web scraping task
scrape_heading_task()

If necessary you might have to use proxies to access the site.

JimKarvo commented 2 days ago

Still not working at ubuntu server (no gui).

I have the same IP as my windows machine. At Windows the script working without any problems.

At linux i tryied this:

from botasaurus.browser import browser, Driver
import time

@browser(add_arguments=['--no-sandbox'])
def scrape_heading_task(driver: Driver, data):
    driver.google_get("https://gitlab.com/users/sign_in")
    time.sleep(10)
    iframe = driver.select_iframe("#turnstile-wrapper iframe")
    driver.save_screenshot()
    checkbox = iframe.select('label', None)
    if checkbox:
        print("detected checkbox")
        checkbox.click()
    time.sleep(1)
    driver.save_screenshot()
    driver.prompt()
    driver.save_screenshot()

    heading = driver.get_text("h1")
    return heading

# Initiate the web scraping task
scrape_heading_task()

Seems that the checkbox isn't clicked (at second screenshot). If I increase the timeout from 10 to 30, the turntile disappeared!