unclecode / crawl4ai

🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
https://crawl4ai.com
Apache License 2.0
16.94k stars 1.26k forks source link

Version 0.3.74 - Output of scraped website to markdown returns an error #287

Open kevintanhongann opened 5 days ago

kevintanhongann commented 5 days ago
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            strategy="markdown"  # Use html2text strategy instead of default markdown
        )
        with open("micronaut_docs.md", "w", encoding="utf-8") as f:
            f.write(result.markdown)  # Use text instead of markdown property

if __name__ == "__main__":
    asyncio.run(main())

I was scraping this documentation site and it returns this error:

Error using new markdown generation strategy: cannot access local variable 'filtered_html' where it is not associated with a value

Is there a workaround for this? Thanks.

b-sai commented 5 days ago

+1 facing same issue

leonson commented 5 days ago

I encountered this too, it's an easy fix so I made a pull request. To workaround locally I believe you can clone the repository, also make a one-line change in markdown_generation_strategy.py at line 104, change

fit_html=filtered_html

to be

fit_html=filtered_html or None

Then in the cloned repository folder, do

pip install -e .

Which will update your local crawl4ai with the local fix(but later once crawl4ai update their package you need to re-install the official package)

chanmathew commented 5 days ago

Also getting this error as well.

adam-pb commented 3 days ago

Facing this error too +1

unclecode commented 2 days ago

@kevintanhongann @chanmathew @adam-pb @leonson @b-sai Hello everybody, I made some changes. The code is running without any issues right now. Please wait. I released the new version tonight: 0.3.743. Then the code you have to use; please pay attention to the code below. Some of the code you guys shared is not correct.

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    async with AsyncWebCrawler(
        headless=True,
        verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.BYPASS,
        )
        print(len(result.markdown_v2.raw_markdown))

        # For compatibility with previous versions, still you can have it like below:
        # print(len(result.markdown))

if __name__ == "__main__":
    asyncio.run(main())

As you can see you do not need to pass anything. Btw I suggest to check result.markdown_v2.markdown_with_citations and result.markdown_v2.references_markdown.

To set the markdown generator strategy you can follow his code:

result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator()
)

One more thing, if you want to use this experimental feature we're working on, it's called Fit Markdown, and what it does is basically it's a subset of the main Markdown, but with less noise. It tries to remove whatever is not relevant to the main purpose of the page. To activate that one, follow the code below, but remember, this is experimental, you.

    async with AsyncWebCrawler(
        headless=True,  
        verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
        )
        print(len(result.markdown_v2.fit_markdown))

By the way, such a long document 😅, the length of extracted markdown is 1,166,105 characters, and the scraping procedure took around 20 seconds, which is pretty fast for such a long document. Anyway, let me know if any issues you guys have.