seleniumbase / SeleniumBase

📊 Blazing fast Python framework for web crawling, scraping, testing, and reporting. Supports pytest. Stealth abilities: UC Mode and CDP Mode.
https://seleniumbase.io
MIT License
5.44k stars 983 forks source link

[Help] Correct way to re-use/clear data #3245

Closed ProtocolNebula closed 3 weeks ago

ProtocolNebula commented 3 weeks ago

Hi,

I'm working with UC mode. Starting from two weeks ago, the website I crawl (maybe due cloudflare) is detecting that I'm not changing the computer (even using a rotative proxy).

I fixed it closing and opening again the selenium base, but is very slow because it's deleting a lot of things and I'm not sure if it's really necessary to pass through the whole process.

My code

removed some of the unnecessary code for the example

def getSBInstance(proxy_string = ""):
    sbInstance = SB(
        uc=True,
        # test=True,
        locale_code="en",
        # uc_cdp_events=True,
        # undetectable=True,
        # undetected=True,
        proxy=proxy_string,
        extension_dir="custom_extensions/block_assets,",
    )

    return sbInstance

def main():
    while True:
        with getSBInstance() as sb:
            # logic here
            sb.driver.close()
            sb.driver.quit()

if __name__ == '__main__':
    main()

Logs generated

I put some reference comments/logs to understand what the code is technically doing

[2024-11-05 03:25:05,804] [DEBUG] [libs.selenium_runner] Getting SeleniumBase instance
[2024-11-05 03:25:05,804] [DEBUG] [libs.selenium_runner] Creating a new SeleniumBase instance
[2024-11-05 03:25:05,972] [DEBUG] [selenium.webdriver.common.driver_finder] Skipping Selenium Manager; path to chrome driver specified in Service class: /home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver
[2024-11-05 03:25:05,973] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4140493 using 0 to output -1
[2024-11-05 03:25:06,382] [INFO] [libs.selenium_runner] (DOING)
[2024-11-05 03:25:06,382] [DEBUG] [libs.selenium_runner] Opening URL: https://www.targetUrl.com
[2024-11-05 03:25:13,595] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4141390 using 0 to output -1
[2024-11-05 03:25:13,842] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4141410 using 0 to output -1
[2024-11-05 03:25:13,907] [INFO] [libs.selenium_runner] Solving cloudflare...
[2024-11-05 03:25:13,908] [INFO] [libs.selenium_runner] No cloudflare coords, trying to get them...
[2024-11-05 03:25:20,914] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4141558 using 0 to output -1
[2024-11-05 03:25:28,300] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4142303 using 0 to output -1
[2024-11-05 03:25:28,358] [INFO] [libs.selenium_runner] Cloudflare coords: 238.0, 431.0
[2024-11-05 03:25:28,358] [ERROR] [libs.selenium_runner] ERROR reading cloudflare coords: Cloudflare not resolved. Coords read to do it manually in the next iteration
[2024-11-05 03:25:28,358] [ERROR] [libs.selenium_runner] ERROR: Cloudflare challenge error
[2024-11-05 03:25:31,848] [DEBUG] [uc] Terminating the UC browser
[2024-11-05 03:25:31,894] [DEBUG] [uc] Stopping webdriver service
[2024-11-05 03:25:31,907] [DEBUG] [uc] Successfully removed /tmp/tmpx6s469fy
[2024-11-05 03:25:31,910] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4142394 using 0 to output -1
[2024-11-05 03:26:31,967] [DEBUG] [uc] Terminating the UC browser
[2024-11-05 03:26:32,018] [DEBUG] [uc] Stopping webdriver service

# NEXT LOOP

[2024-11-05 03:28:16,529] [DEBUG] [libs.selenium_runner] Creating a new SeleniumBase instance
[2024-11-05 03:28:16,715] [DEBUG] [selenium.webdriver.common.driver_finder] Skipping Selenium Manager; path to chrome driver specified in Service class: /home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver
[2024-11-05 03:28:16,716] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4150101 using 0 to output -1
[2024-11-05 03:28:17,055] [DEBUG] [libs.selenium_runner] Opening URL: https://www.targetUrl.com
[2024-11-05 03:28:24,306] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4150385 using 0 to output -1
[2024-11-05 03:28:24,544] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4150391 using 0 to output -1
[2024-11-05 03:28:24,605] [INFO] [libs.selenium_runner] Solving cloudflare...
[2024-11-05 03:28:50,186] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4151939 using 0 to output -1
[2024-11-05 03:28:50,242] [DEBUG] [libs.selenium_runner] (DOING STUFF AFTER CLOUDFLARE SOLVED)
[2024-11-05 03:30:17,883] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4155631 using 0 to output -1
[2024-11-05 03:30:21,611] [INFO] [libs.selenium_runner] (STUFF DONE BUT NO CLOSE YET)
[2024-11-05 03:30:28,349] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4156441 using 0 to output -1
[2024-11-05 03:30:21,611] [INFO] [libs.selenium_runner] (MORE LOGS  BUT NO CLOSE YET)
[2024-11-05 03:30:53,043] [DEBUG] [uc] Terminating the UC browser
[2024-11-05 03:30:53,065] [DEBUG] [uc] Stopping webdriver service
[2024-11-05 03:30:53,080] [DEBUG] [uc] Successfully removed /tmp/tmpfowzd28k
[2024-11-05 03:30:54,690] [DEBUG] [selenium.webdriver.common.service] Started executable: `/home/dev/repos/myProject/.venv/lib/python3.11/site-packages/seleniumbase/drivers/uc_driver` in a child process with pid: 4157395 using 0 to output -1
[2024-11-05 03:31:54,778] [DEBUG] [uc] Terminating the UC browser

Summary

I want to avoid the 1 or 2 minutes + RAM compsumition required to refresh the whole driver, the docs I found does not work. (The proxy is not the issue, it is already rotative).

Any tip/trick?

Stackoverflow / docs are not useful for this case or I didn't find the correct ones.

mdmintz commented 3 weeks ago

It looks like you aren't using the special methods like uc_gui_click_captcha() to handle CF CAPTCHAs. Also, there's a new CDP Mode, which is more advanced than regular UC Mode.

CDP Mode is activated by calling sb.activate_cdp_mode(url) from UC Mode. Eg:

from seleniumbase import SB

with SB(uc=True, test=True, locale_code="en") as sb:
    url = "https://gitlab.com/users/sign_in"
    sb.activate_cdp_mode(url)
    sb.uc_gui_click_captcha()
    sb.assert_text("Username", '[for="user_login"]', timeout=3)
    sb.assert_element('label[for="user_login"]')
    sb.highlight('button:contains("Sign in")')
    sb.highlight('h1:contains("GitLab.com")')
    sb.post_message("SeleniumBase wasn't detected", duration=4)