unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.39k stars 1.2k forks source link

Could it be possible to take control of the user's own browser instead? #214

Closed BZBY closed 2 weeks ago

BZBY commented 3 weeks ago

Some websites have a CAPTCHA mechanism that is repeated, while Playwright already has a feature to take control of the user's own browser (e.g., by launching the Chrome browser through the command line in CMD terminal as follows:

cd C:\Program Files\Google\Chrome\Application && chrome.exe \
--remote-debugging-port=9222 \
--user-data-dir=“C:\playwright\user_data”

Can crawl4ai support this? This would significantly reduce the need to write code to handle CAPTCHAs, and it would be great if this functionality could be implemented!

unclecode commented 2 weeks ago

Hi @BZBY thank you so much for the suggestion, I really appreciate it. I've implemented the changes - see the code below. There's a new flag that allows the Chromium browser to run in control of the user's own browser, similar to your suggestion. It also accepts a specific user data directory, offering a lot of possibilities. For example, you can copy and paste part of your current user data into this folder and then pass it. I've tested this on a few websites that were previously challenging to crawl, and it's now easier.

This makes sense, as it's the user's own browser. Additionally, you can set the browser type, and the printing system automatically detects whether you're running on a Mac, Windows, or Linux. Since I'm on a Mac, I haven't tested on Windows yet - maybe you can try that.

These changes are in the 0.3.73 branch, and I'll push them to the new version soon. Thanks again for the suggestion. I wouldn't have thought of it. If you'd like, I'd love to invite you to our Discord to work with us and suggest more improvements for the engine.

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(
        headless=True, # Set to False to see what is happening
        use_managed_browser=True,
        browser_type="chromium",
    ) as crawler:
        result = await crawler.arun(
            url="https://crawl4ai.com",
            bypass_cache=True,
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())
BZBY commented 2 weeks ago

@unclecode Thanks for the update! I will try out the 0.3.73 branch and provide feedback. I'm also interested in joining the Discord!

unclecode commented 2 weeks ago

@BZBY You are most welcome, and please share your email address then I send the invitation link, perhaps you can start to help by testing this new features.

BZBY commented 2 weeks ago

@unclecode OK,I’m already testing the new features. my email: bzbyAi@protonmail.com

unclecode commented 2 weeks ago

@BZBY already sent, welcome :)