seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
5.3k stars 973 forks source link

Unable to run on AWS Lambda #2459

Closed andinua closed 9 months ago

andinua commented 9 months ago

Hello and thank you for building and maintaining this project.

I was previously using undetected-chromedriver but decided to give SeleniumBase a try seeing how it's actively maintained.

I am running a project using SeleniumBase on AWS lambda. It does not go undetected, but triggers the website's CloudFlare turnstile challenge. I have managed to solve the challenge by iframe switch and click, on my local Dockerized environment.

My setup

The code I use to solve the challenge, borrowed from examples seen on this repo/Issue board:

with SB(uc=True, headless=True) as sb: 
        data = load_data(url, sb)
    sb.driver.uc_open_with_reconnect(url, 10)
    sb.sleep(1)
    if not sb.is_element_visible('iframe[src*="challenge"]'):
        logger.info("Haven't found the challenge yet...")
        sb.get_new_driver(undetectable=True)
        sb.driver.get(url)
        sb.sleep(1)
    if sb.is_element_visible('iframe[src*="challenge"]'):
        with sb.frame_switch('iframe[src*="challenge"]'):
            logger.info("Found challenge, going in")
            sb.wait_for_element("span.mark")
            if not sb.is_element_visible("span.mark"):
                print('Could not find mark')
                sb.sleep(10)
            sb.click("span.mark")
            i = 1
            while i < 10: 
                i+=1
                sb.sleep(1)
                sb.save_screenshot(f'screen{i}.png')
    soup = BeautifulSoup(sb.driver.page_source, "html.parser")

This works nicely in a local Docker container. However, when deployed on AWS lambda, it seems to crash Chrome.

image image image

The only setup that seems to work on AWS lambda is adding

chrome_options.add_argument("--single-process")

However, this no longer seems to be able to solve the challenge, and the turnstile ends up in a loop after the click action.

I have tried both headless and headed modes (with Xvfb enabled), same outcome.

Do you have any idea of what I could be doing differently/ whether I'm approaching this correctly? Any hints are much appreciated. Thank you.

mdmintz commented 9 months ago

The only setup that seems to work on AWS lambda is adding chrome_options.add_argument("--single-process") However, this no longer seems to be able to solve the challenge, and the turnstile ends up in a loop after the click action.

How are you adding it? There's chromium_arg for that. Eg:

with SB(uc=True, chromium_arg="--single-process") as sb:

If you need more args, it takes a comma-separated string, no spaces.

As for evading detection on AWS Lambda, you need to use a proxy server because that IP Range is blacklisted to Cloudflare. Would be the same issue if running on GitHub Actions and other well known cloud services that Cloudflare has already identified the IP Ranges of.

To change proxy settings for open and authenticated proxy servers, use the proxy arg with SB():

proxy="SERVER:PORT"

proxy="USERNAME:PASSWORD@SERVER:PORT"

You'll need to provide your own proxy servers or find them online.

andinua commented 9 months ago

Thank you. I didn't know you could pass args that way, have tried it now, but to no avail. My original experiment was just to hack browser_launcher.py and try and add them there, as a proof of concept.

I will try my luck with proxies, thank you for the suggestion.

mdmintz commented 8 months ago

@andinua Also try again with the latest version of SeleniumBase. https://github.com/seleniumbase/SeleniumBase/issues/2523 was found and fixed. That may have been the root issue. Let me know if that fixes headless UC Mode.

andinua commented 8 months ago

Thanks @mdmintz , I've finally settled on using proxies and it works. But I'll try the headless without --single-process as well.

teddy-the-steady commented 8 months ago

@andinua I'm using same approach with you. But using aws lambda gives me error that it's read-only and it cannot download chromedriver into lambda instance. So I made a Dockerfile to use predefined image with serverless framework to have already downloaded chromedriver. But couldn't figure out how to use downloaded one not downloading new one from this package. How did you solve this?

teddy-the-steady commented 8 months ago

@andinua Had a full day research about this. I could find so many ppl talking about downloaded_files folder where seleniumbase frequently tries to write but lambda doesn't allow to. I wanted to use pre downloaded driver but nothing seems to work. If you can share any work-around for this, it would be very appreciated. I assume you are a lambda expert,, please share us knowledge🙇

xtream1101 commented 4 months ago

I am using a selenium-grid cluster and using seleniumbase in an aws lambda. I am using proxies so that makes a proxies.zip file that caused issues with the read-only filesystem. I was able to do a hacky workaround just for this use case with the following code before I imported any other seleniumbase imports

import time
import seleniumbase.core.proxy_helper

# Hacky workaround since lambdas have a read-only filesystem and it wants to use the current dir by default
BASE_FOLDER = f"/tmp/{time.time()}" # This could persist across lambda runs, do not want to conflict with another currently running lambda, not sure if the proxy.zip is for a specific proxy or just a generic files

seleniumbase.core.proxy_helper.DOWNLOADS_DIR = BASE_FOLDER
seleniumbase.core.proxy_helper.PROXY_ZIP_PATH = f"{BASE_FOLDER}/proxy.zip"

I was not able to find a way to override the constants globally, but found this worked to deal with the proxy.zip file that gets created.