Open crelocks opened 3 days ago
Couldn't wait for your response, but I tried it myself and it worked. Posting it here for anyone else looking to find out a solution to bypass cloudflare. Please use the above logic to get the HTML and below function call to parse it
from crawl4ai.chunking_strategy import RegexChunking
crawler.aprocess_html(
url=url,
html=html_data,
css_selector=css_selector,
extracted_content=None,
word_count_threshold=0,
chunking_strategy=RegexChunking(),
screenshot=False,
verbose=True,
is_cached=False,
extraction_strategy=extraction_strategy
)
Closing this for now
@crelocks Thanks for using our library. I'm glad you requested offline crawling; it's been suggested by other users. We can pass a local file or HTML by using its path as the URL or passing the HTML as a direct string. We'll implement this feature soon. In the meantime, could you share an example link that caused issues for you, requiring a flare bypasser? We've made recent changes, including the managed browser ability, which allows crawling with a custom browser using your own chrome or browser profile. If you provide an example, I'll demonstrate a few ways to do it. We'll also work on implementing your requested feature.
https://icvcm.org/news-insights/ This is one such example.
@crelocks In the version 0.3.74, It crawled beautifully in my test. Stay tuned for the release within a day or two, then test and let me know.
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
async with AsyncWebCrawler(
verbose=True,
headless=True,
# Optional: You can switch to persistant mode to have much stringer anti-bot
# user_data_dir=user_data_dir,
# use_persistent_context=True,
# Optional: You can set custom headers
# headers={
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
# "Accept-Language": "en-US,en;q=0.5",
# "Accept-Encoding": "gzip, deflate, br",
# "DNT": "1",
# "Connection": "keep-alive",
# "Upgrade-Insecure-Requests": "1",
# "Sec-Fetch-Dest": "document",
# "Sec-Fetch-Mode": "navigate",
# "Sec-Fetch-Site": "none",
# "Sec-Fetch-User": "?1",
# "Cache-Control": "max-age=0",
# }
) as crawler:
url = "https://icvcm.org/news-insights/"
result = await crawler.arun(
url,
bypass_cache=True,
magic=True, # This is important
)
assert result.success, f"Failed to crawl {url}: {result.error_message}"
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown)}")
I have a specific use case where Cloudflare blocks often prevent successful crawling, making it challenging to bypass with
crawl4ai
. To handle this, we tried using flare-bypasser to retrieve the raw HTML content, including scripts and other assets.Feature Request
Rather than providing a URL to
crawl4ai
, I’d like to directly feed in the raw HTML content retrieved throughflare-bypasser
. The idea is forcrawl4ai
to apply the usual schema, strategy, and processing as it would for a URL-based input.Existing Solution Exploration
I came across a method named
aprocess_html
in the source code: async_webcrawler.py#L175. However, this method isn't documented, and I'm unsure if it would support this functionality or if it’s meant for a different purpose.Could you let me know if there's a way to directly pass raw HTML content to
crawl4ai
? Additionally, ifaprocess_html
is relevant here, could you provide documentation or guidance on how to use it for this purpose?Thank you!