unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.62k stars 1.23k forks source link

Remove Headers, Footers, External Links and their related data #181

Open syed-al opened 1 month ago

syed-al commented 1 month ago

Hi, Thanks for this great work.

I have been playing around with this, to crawl webpages and get content in markdown format, which can be used to provide to LLMs for grounding. But when I used them to say get data for news articles, I get lots and lots of unnecessary data like headers, footers, nav options, other article links. More than 50% of the data is not the actual data.

Yes, I can use LLMExtraction, but that will increase the bill tremendously as the input are like 10-20 web articles, with each web article ranging around 5000-7000 tokens. I saw one option is to provide elements, but I want the crawler to be generic, which can work on any website, so I don't have any fix elements which I know can remove header, nav, footer information for sure. Any way where the playwright extractor, can get the actual content of the webpage. I understand this may not be perfect, but I want to reduce the extra content as much as possible while being generic across all websites.

P.S: I would love to contribute to this project, I am not much experienced in the JS/TS side, but I am pretty confident on the python side of things. So is there any discord/slack/telegram group where I can join to discuss on how to contribute?

mentaLwz commented 1 month ago

I have the same need too. Perhaps you can try things like this to get the main content, I used this for my own need now :

def clean_content(content):
    # Find the start of the content (first # title)
    start_index = content.find('#')
    if start_index == -1:
        return ""  # No title found

    # Find the end of the content (next ## title)
    end_index = content.find('##', start_index + 1)
    if end_index == -1:
        # If no ## title found, return until the end
        return content[start_index:]
    else:
        return content[start_index:end_index].strip()

article['markdown_content'] = asyncio.run(crawl_url(article['source_url']))
article['markdown_content'] = clean_content(article['markdown_content'])
syed-al commented 1 month ago

I too started with this, but the problem is many times, the title is not h2(##), sometime is h1(#) or h3(###) etc. And also the ads, signup banners and other links too will have ##, causing all other content also to be read. It's difficult I think to get the content from generic pages deterministically without LLMs.

mentaLwz commented 1 month ago

agreed.

unclecode commented 1 month ago

@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together.

QuangTQV commented 1 month ago

@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together.

I also need to remove unnecessary content. You should set additional parameters such as HTML selectors, excluding specific HTML tags that start with 'header', 'footer', and similar.

QuangTQV commented 1 month ago

@syed-al Thx for using Crawl4ai, Would you please share with me the URL you're trying to get the best markdown out of it? I think there are a bunch of things that I can share with you to help you achieve that, as it has been one of our goals to produce as much AI friendly data as possible without using LLM. Could you please give me the link, and I'll give it a try? I'd be more than happy to invite you to become one of our collaborators. Would you kindly share your email address? I can then send you an invitation to our Discord channel. Perhaps this could be one of the areas where you can help us and work together.

Clean the HTML first and then convert the HTML to markdown.

unclecode commented 1 month ago

@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72):

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py

            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the excluded_tags that you can pass. You can see that here we have parameters to exclude or include external links, and the same goes for social media links such as Facebook, Pinterest, and so on. You can also pass extra domains that you want to be excluded from the links, and the same thing goes for images.

Another thing you can see here is the concept of the word count threshold. By setting this amount to a high number, like 10, you are basically excluding any HTML blocks that contain text with less than 10 words. This is a very useful way of removing unnecessary text. You can also use this one, depending on your needs.

Finally, our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way.

@QuangTQV I already shared same answer to your other issues, perhaps I close that issue and we continue here thx.

syed-al commented 1 month ago

Here is the sample url to get the content: https://www.hindustantimes.com/world-news/us-news/cnn-reporter-chokes-on-laughter-covering-trumps-x-rated-arnold-palmer-remark-live-watch-101729484003535.html

It is a new article, where are lot of links, extra information, other article information etc, which I would want to neglect and only get the main content of the article. The main article is around 30 lines, but the entire markdown I get is more than 600-700 lines.

 # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler(verbose=False) as crawler:
        # Run the crawler on a URL
        results = await crawler.arun_many(
            urls=urls,
        )

I am for now using the basic usage without any extra flags. Will try the flags you mentioned

Also email id for discord invitation: abdksyed@gmail.com

QuangTQV commented 1 month ago

@syed-al @mentaLwz @QuangTQV There are multiple flags and parameters you can use to control the level of data-cleaning you may need. Please share with me your URL, then I give you an example. However I share one example that I think will answer your question (This will be available in 0.3.72):

async def main():
    async with AsyncWebCrawler(headless = True, sleep_on_close = True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            # bypass_cache=True,
            word_count_threshold = 10,
            excluded_tags = ['form'], # Optional - Default is None, this adds more control over the content extraction for markdown
            exclude_external_links = False, # Default is True
            exclude_social_media_links = True, # Default is True
            exclude_external_images = True, # Default is False
            # social_media_domains = ["facebook.com", "twitter.com", "instagram.com", ...] Here you can add more domains, default supported domains are in config.py

            html2text = {
                "escape_dot": False,
                # Add more options here
            }
        )
        # Save markdown to file
        with open(os.path.join(__data, "mexico_places.md"), "w") as f:
            f.write(result.markdown)

    print("Done")

Here you can see some flags and parameters that allow you to focus more on the content relevant to what you're searching for. This is a website about places that can be visited in Mexico, where you can see the excluded_tags that you can pass. You can see that here we have parameters to exclude or include external links, and the same goes for social media links such as Facebook, Pinterest, and so on. You can also pass extra domains that you want to be excluded from the links, and the same thing goes for images.

Another thing you can see here is the concept of the word count threshold. By setting this amount to a high number, like 10, you are basically excluding any HTML blocks that contain text with less than 10 words. This is a very useful way of removing unnecessary text. You can also use this one, depending on your needs.

Finally, our markdown generator uses Html2Text and you can override some of the standard parameters to use it in an advanced way.

@QuangTQV I already shared same answer to your other issues, perhaps I close that issue and we continue here thx.

Could you provide an example with this URL? 'https://tiki.vn/search?q=gi%C3%A0y%20adidas

unclecode commented 1 month ago

@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown.

async def main():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "Tiki Shoes",
        "baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
        "fields": [
            {
                "name": "image",
                "selector": "picture > img",
                "type": "attribute",
                "attribute": "src",
            },
            {
                "name": "price",
                "selector": '.price-discount__price',
                "type": "text",
            },
            {
                "name": "brand",
                "selector": ".above-product-name-info",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "h3",
                "type": "text",
            }
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(
            url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            # delay_before_return_html = 1,
            wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
            magic=True
        )

        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))

        with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
            f.write(result.extracted_content)

Output:

[{
  "price": "971.000\u20ab",
  "brand": "BITI'S",
  "description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...]
QuangTQV commented 1 month ago

@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown.

async def main():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "Tiki Shoes",
        "baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
        "fields": [
            {
                "name": "image",
                "selector": "picture > img",
                "type": "attribute",
                "attribute": "src",
            },
            {
                "name": "price",
                "selector": '.price-discount__price',
                "type": "text",
            },
            {
                "name": "brand",
                "selector": ".above-product-name-info",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "h3",
                "type": "text",
            }
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(
            url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            # delay_before_return_html = 1,
            wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
            magic=True
        )

        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))

        with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
            f.write(result.extracted_content)

Output:

[{
  "price": "971.000\u20ab",
  "brand": "BITI'S",
  "description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...]

I want to extract with all URLs, this is just an example URL, how can I remove redundant data?

unclecode commented 1 month ago

@syed-al I just added a new heuristic function that has the capability to produce a better markdown, which I call 'fit markdown.' The way they're going to use it is as follows: [code]. I also share with you the output of the markdown: it's much cleaner and I really like it. It's good for pages like the example you shared, and it will be soon released in the new version.

async def main():
    async with AsyncWebCrawler(verbos=True) as crawler:
        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            word_count_threshold = 10,
        )
        # Save markdown to file
        with open(os.path.join(__data__, "mexico_places.md"), "w") as f:
            f.write(result.fit_markdown)

    print("Done")

Everything remains the same. You simply pick up the fit_markdown property of the result crawl item. I also attach the markdown here. Result is very clean and exactly contains the main part. Btw Its under the test, perhaps this will be the first task you can help upon joining to Discord.

mexico_places.md

QuangTQV commented 1 month ago

@QuangTQV For pages with repetitive patterns, I suggest using the JSON CSS extraction. Let's look at the following code, which crawls one page and turns it into a list of JSON. Your case does not require the extraction of the markdown.

async def main():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "Tiki Shoes",
        "baseSelector": ".CatalogProducts__Wrapper-sc-1r8ct7c-0 > div",
        "fields": [
            {
                "name": "image",
                "selector": "picture > img",
                "type": "attribute",
                "attribute": "src",
            },
            {
                "name": "price",
                "selector": '.price-discount__price',
                "type": "text",
            },
            {
                "name": "brand",
                "selector": ".above-product-name-info",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "h3",
                "type": "text",
            }
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(
            url="https://tiki.vn/search?q=gi%C3%A0y%20adidas",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
            # delay_before_return_html = 1,
            wait_for = "css:.CatalogProducts__Wrapper-sc-1r8ct7c-0 picture > img", # Important: This make sure the dynamic data is loaded and available
            magic=True
        )

        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))

        with open(os.path.join(__data__, "tiki_shoes.json"), "w") as f:
            f.write(result.extracted_content)

Output:

[{
  "price": "971.000\u20ab",
  "brand": "BITI'S",
  "description": "Gi\u00e0y Th\u1ec3 Thao Nam - N\u1eef Biti's Hunter X - 2K22 - Midnight III DSUH00502DEN (\u0110en)"
}, ...]

I want to extract with all URLs, this is just an example URL, how can I remove redundant data?

@unclecode Can you help me @@ ?

syed-al commented 1 month ago

Hi @unclecode

When would the new version be released on pip?

Also, waiting for the discord invite: abdksyed@gmail.com

unclecode commented 1 month ago

@syed-al tomorrow, Friday. Sorry for the delay, I will send the invitation before weekend, and you are most welcome.

unclecode commented 1 month ago

@QuangTQV No worries, Let me first understand what exactly you're looking for. You said you want to do this with all URLs. Maynard, what do you refer to by 'all URLs'? When you're using the JsonCssExtractionStrategy, it's a mechanism that works exactly with one specific layout on the page. So I have to understand what you mean by 'all URLs'. Are you referring to any page or any form so that you can use the JsonCssExtractionStrategy, unless all the pages you're using have a similar HTML structure? So I guess I didn't get your question properly. You explained the task you have in a complete way rather than the example, and then I will try to illustrate a code for you to show you how it works.

QuangTQV commented 1 month ago

@QuangTQV No worries, Let me first understand what exactly you're looking for. You said you want to do this with all URLs. Maynard, what do you refer to by 'all URLs'? When you're using the JsonCssExtractionStrategy, it's a mechanism that works exactly with one specific layout on the page. So I have to understand what you mean by 'all URLs'. Are you referring to any page or any form so that you can use the JsonCssExtractionStrategy, unless all the pages you're using have a similar HTML structure? So I guess I didn't get your question properly. You explained the task you have in a complete way rather than the example, and then I will try to illustrate a code for you to show you how it works.

I want to create a chatbot for any website, so I need to crawl the website's content and then use RAG. What I need is to save costs, and the more junk content I can eliminate while crawling, the better. And of course, each website has a different layout, so I can't use a fixed regex.

unclecode commented 3 weeks ago

@syed-al Hi, sorry for the delay, it has been very hectic last two weeks, but now all good, I just send the invitation link to you, looking forward to see you on the other side, work on this smart FIT markdown ;)