Open kevintanhongann opened 5 days ago
+1 facing same issue
I encountered this too, it's an easy fix so I made a pull request. To workaround locally I believe you can clone the repository, also make a one-line change in markdown_generation_strategy.py at line 104, change
fit_html=filtered_html
to be
fit_html=filtered_html or None
Then in the cloned repository folder, do
pip install -e .
Which will update your local crawl4ai with the local fix(but later once crawl4ai update their package you need to re-install the official package)
Also getting this error as well.
Facing this error too +1
@kevintanhongann @chanmathew @adam-pb @leonson @b-sai Hello everybody, I made some changes. The code is running without any issues right now. Please wait. I released the new version tonight: 0.3.743. Then the code you have to use; please pay attention to the code below. Some of the code you guys shared is not correct.
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
async with AsyncWebCrawler(
headless=True,
verbose=True,
) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.BYPASS,
)
print(len(result.markdown_v2.raw_markdown))
# For compatibility with previous versions, still you can have it like below:
# print(len(result.markdown))
if __name__ == "__main__":
asyncio.run(main())
As you can see you do not need to pass anything. Btw I suggest to check result.markdown_v2.markdown_with_citations
and result.markdown_v2.references_markdown
.
To set the markdown generator strategy you can follow his code:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator()
)
One more thing, if you want to use this experimental feature we're working on, it's called Fit Markdown, and what it does is basically it's a subset of the main Markdown, but with less noise. It tries to remove whatever is not relevant to the main purpose of the page. To activate that one, follow the code below, but remember, this is experimental, you.
async with AsyncWebCrawler(
headless=True,
verbose=True,
) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
)
print(len(result.markdown_v2.fit_markdown))
By the way, such a long document 😅, the length of extracted markdown is 1,166,105 characters, and the scraping procedure took around 20 seconds, which is pretty fast for such a long document. Anyway, let me know if any issues you guys have.
I was scraping this documentation site and it returns this error:
Error using new markdown generation strategy: cannot access local variable 'filtered_html' where it is not associated with a value
Is there a workaround for this? Thanks.