unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.62k stars 1.23k forks source link

Extract only the cleaned markdown #186

Closed QuangTQV closed 1 month ago

QuangTQV commented 1 month ago

How can I extract only the cleaned markdown? Right now it contains headers, footers, advertisements, etc.

unclecode commented 1 month ago

@QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question:

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py

            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the excluded_tags that you can pass. You can see that here we have parameters to exclude or include external links, and the same goes for social media links such as Facebook, Pinterest, and so on. You can also pass extra domains that you want to be excluded from the links, and the same thing goes for images. Our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way. However, please share your links and I'll provide you with a customized code for your specific link.

unclecode commented 1 month ago

I close this issue and please follow this on https://github.com/unclecode/crawl4ai/issues/181