unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.12k stars 1.18k forks source link

Support for Direct HTML Parsing in crawl4ai #253

Open crelocks opened 3 days ago

crelocks commented 3 days ago

I have a specific use case where Cloudflare blocks often prevent successful crawling, making it challenging to bypass with crawl4ai. To handle this, we tried using flare-bypasser to retrieve the raw HTML content, including scripts and other assets.

Feature Request

Rather than providing a URL to crawl4ai, I’d like to directly feed in the raw HTML content retrieved through flare-bypasser. The idea is for crawl4ai to apply the usual schema, strategy, and processing as it would for a URL-based input.

Existing Solution Exploration

I came across a method named aprocess_html in the source code: async_webcrawler.py#L175. However, this method isn't documented, and I'm unsure if it would support this functionality or if it’s meant for a different purpose.

Could you let me know if there's a way to directly pass raw HTML content to crawl4ai? Additionally, if aprocess_html is relevant here, could you provide documentation or guidance on how to use it for this purpose?

Thank you!

crelocks commented 2 days ago

Couldn't wait for your response, but I tried it myself and it worked. Posting it here for anyone else looking to find out a solution to bypass cloudflare. Please use the above logic to get the HTML and below function call to parse it

from crawl4ai.chunking_strategy import RegexChunking

crawler.aprocess_html(
            url=url,
            html=html_data,
            css_selector=css_selector,
            extracted_content=None,
            word_count_threshold=0,
            chunking_strategy=RegexChunking(),
            screenshot=False,
            verbose=True,
            is_cached=False,
            extraction_strategy=extraction_strategy
        )

Closing this for now

unclecode commented 2 days ago

@crelocks Thanks for using our library. I'm glad you requested offline crawling; it's been suggested by other users. We can pass a local file or HTML by using its path as the URL or passing the HTML as a direct string. We'll implement this feature soon. In the meantime, could you share an example link that caused issues for you, requiring a flare bypasser? We've made recent changes, including the managed browser ability, which allows crawling with a custom browser using your own chrome or browser profile. If you provide an example, I'll demonstrate a few ways to do it. We'll also work on implementing your requested feature.

crelocks commented 2 days ago

https://icvcm.org/news-insights/ This is one such example.

unclecode commented 1 day ago

@crelocks In the version 0.3.74, It crawled beautifully in my test. Stay tuned for the release within a day or two, then test and let me know.

import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler

async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)

    async with AsyncWebCrawler(
        verbose=True,
        headless=True,
        # Optional: You can switch to persistant mode to have much stringer anti-bot
        # user_data_dir=user_data_dir,
        # use_persistent_context=True,
        # Optional: You can set custom headers
        # headers={
        #     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        #     "Accept-Language": "en-US,en;q=0.5",
        #     "Accept-Encoding": "gzip, deflate, br",
        #     "DNT": "1",
        #     "Connection": "keep-alive",
        #     "Upgrade-Insecure-Requests": "1",
        #     "Sec-Fetch-Dest": "document",
        #     "Sec-Fetch-Mode": "navigate",
        #     "Sec-Fetch-Site": "none",
        #     "Sec-Fetch-User": "?1",
        #     "Cache-Control": "max-age=0",
        # }
    ) as crawler:
        url = "https://icvcm.org/news-insights/"
        result = await crawler.arun(
            url,
            bypass_cache=True,
            magic=True, # This is important
        )

        assert result.success, f"Failed to crawl {url}: {result.error_message}"
        print(f"Successfully crawled {url}")
        print(f"Content length: {len(result.markdown)}")