unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.29k stars 1.2k forks source link

Example on whole-blog crawling? #110

Closed BradKML closed 3 weeks ago

BradKML commented 1 month ago

Thanks for creating alternatives to FireCrawl for LLMs! Here is a bit of a question: are there examples or shortcuts for crawling a whole blog (may not may not have things like CloudFlare)?

  1. How would the speed of crawling be managed such that the crawler won't be blocked?
  2. Could the out-links to other articles be also captured (just in case for context)?
  3. How can individual articles be separated from paginated web indexes?
  4. Are there ways to hack infinite scrolling for blogs without proper sitemap?
aravindkarnam commented 1 month ago

@BradKML We are currently working on a scraper that can crawl entire website. It's configurable to the extent of setting the depth level upto which you want to crawl, Adding filters to exclude certain URLs from crawling etc. Features you requested are also on the roadmap. This is most likely to released in the next version.

Agaba-Ed commented 1 month ago

Hello @aravindkarnam . I'm new to open-source and excited to contribute to this project. Would it be alright if I started working on one of these features? I'm open to any of them, but would appreciate if you assign a specific task.

aravindkarnam commented 1 month ago

That's great to hear @Agaba-Ed! We are currently collaborating on a discord server. Share with me your email address, I'll invite you there and we can look at the tasks at hand and you can pick one you like.

Agaba-Ed commented 1 month ago

Thank you @aravindkarnam ! My email is agabsed@gmail.com.

OdinMB commented 1 month ago

Idea for an advanced feature that I found valuable in some projects: Allow users to specify an LLM and a prompt describing what content they're interested in. Based on these parameters, the app can identify the most relevant pages (and PDFs) instead of just going through the full queue of URLs that the crawler found.

aravindkarnam commented 1 month ago

Thank you @aravindkarnam ! My email is agabsed@gmail.com.

@Agaba-Ed Check your email inbox. Sent you invite. See you on discord.

aravindkarnam commented 1 month ago

Idea for an advanced feature that I found valuable in some projects: Allow users to specify an LLM and a prompt describing what content they're interested in. Based on these parameters, the app can identify the most relevant pages (and PDFs) instead of just going through the full queue of URLs that the crawler found.

That's a great idea @OdinMB . In the upcoming versions we are already planning a filter for pages to crawl based on keywords in the page metadata. However we were only planning a simple and string/regex based pattern matching. We can consider LLM and prompt for filtering as well once we have a basic pattern based filtering ready.

BradKML commented 1 month ago

@OdinMB @aravindkarnam are you sure that is tasked for llm, or would that more of a things for embeddings? Also cross-referencing with this since rerankers are also similar in function to embeddings https://github.com/ollama/ollama/issues/3368

OdinMB commented 1 month ago

@OdinMB @aravindkarnam are you sure that is tasked for llm, or would that more of a things for embeddings? Also cross-referencing with this since rerankers are also similar in function to embeddings ollama/ollama#3368

In many cases, regex, cosine similarity or a reranking of URLs will be sufficient. I had some use cases where a quick request to a small LLM was needed. For an analysis of NGOs, I was interested in their problem statement, solution, robustness of empirical evidence, and some other information. An embedding of a string that combines these categories would not have worked well.

BradKML commented 1 month ago

@OdinMB in this case, how would you mix Cosine Similarity (or other distance metric) and SLMs together? That is assuming that their documents are all information-dense, and sometimes even would multi-hop between different documents or segments of documents (e.g. blogs, dialogue transcripts).

OdinMB commented 1 month ago

The idea is to prioritize URLs before you know their content, just based on the URL itself. Let's say you're looking for empirical evidence of the social impact of an organization and the queue has the following URLs:

The feature would rerank this queue so that the 2023-yearly-report is crawled first, or even discard the other URLs entirely.

BradKML commented 1 month ago

@OdinMB assuming you do know that the scraper only targets the most important sites or sections, and that the page title may or may not be in the URL path (e.g. they use page id instead of name-based paths), how would that be handled? Task based URL filter => page embedding for topic => SLM/LLM for individual information? Sorry this is slightly more towards blogs, online educational resource, and educational forums, rather than auditing organizations, but it might be related.

OdinMB commented 1 month ago

@OdinMB assuming you do know that the scraper only targets the most important sites or sections, and that the page title may or may not be in the URL path (e.g. they use page id instead of name-based paths), how would that be handled? Task based URL filter => page embedding for topic => SLM/LLM for individual information? Sorry this is slightly more towards blogs, online educational resource, and educational forums, rather than auditing organizations, but it might be related.

I only use the feature conservatively. It should order URLs from top to bottom:

That way, informative URLs with relevant content are prioritized without discarding anything with unclear content.

unclecode commented 1 month ago

Thanks for creating alternatives to FireCrawl for LLMs! Here is a bit of a question: are there examples or shortcuts for crawling a whole blog (may not may not have things like CloudFlare)?

  1. How would the speed of crawling be managed such that the crawler won't be blocked?
  2. Could the out-links to other articles be also captured (just in case for context)?
  3. How can individual articles be separated from paginated web indexes?
  4. Are there ways to hack infinite scrolling for blogs without proper sitemap?

@BradKML Thank you for your questions about our web crawling capabilities. I'll address each of your points:

  1. How would the speed of crawling be managed such that the crawler won't be blocked?

Our current library focuses on single URL crawling, not full-scale scraping. However, we're developing a scraping engine with configurable parameters. One key feature is implementing random delays between requests, following a specific statistical distribution. We're also incorporating at least five different techniques to respect website policies and avoid blocking. Our goal is to provide a robust solution that balances efficiency with responsible crawling practices.

  1. Could the out-links to other articles be also captured (just in case for context)?

Absolutely. Both external and internal links are available in the crawling results. This means you can build your own scraper using this information without waiting for our full scraper engine. Our upcoming scraper engine will essentially implement a graph search algorithm, allowing for more comprehensive link exploration and context gathering.

  1. How can individual articles be separated from paginated web indexes?

I'd appreciate some clarification on this question. Are you referring to extracting specific content from pages with multiple articles? Or perhaps distinguishing between full articles and summary snippets on index pages? If you could provide more context or a specific example, I'd be happy to offer a more targeted solution or explain how our upcoming features might address this challenge.

  1. Are there ways to hack infinite scrolling for blogs without proper sitemap?

Our crawler actually provides a powerful solution for this. Users can pass custom JavaScript code to our crawler, which we execute before proceeding with the crawling. This gives you full control over the page without becoming dependent on our library.

For infinite scrolling scenarios, you have several options:

  1. Continue scrolling until you reach certain criteria. For example, you could use a language model to assess data quality at various stages of scrolling, stopping when quality thresholds are met. This is an effective approach specially if you use a fine tuned small languge as a classifier to define do you need more data or its enough,
  2. Define specific stopping criteria, such as reaching a certain number of tokens or finding specific keywords.
  3. Continue till very end!

Or Implement creative control mechanisms based on your unique requirements. Remember you can execute JS one the page and reflect based on the result.

This approach allows for flexible handling of infinite scrolling sites, even without a proper sitemap. You can tailor the crawling process to your specific needs and data quality standards.

Based on my understanding of your fourth question, I've provided this explanation. However, please let me know if I've missed anything or if you need further clarification on any point. I'm here to provide any additional guidance you might need.

unclecode commented 1 month ago

@OdinMB

Idea for an advanced feature that I found valuable in some projects: Allow users to specify an LLM and a prompt describing what content they're interested in. Based on these parameters, the app can identify the most relevant pages (and PDFs) instead of just going through the full queue of URLs that the crawler found.

Yes this is our backlog, I like to call it "agentic crawling" or "smart crawling", please check my reply in this issue https://github.com/unclecode/crawl4ai/issues/131#issuecomment-2408917025

BradKML commented 1 month ago

@unclecode for 3 I am referring to how blog navigation has paginated display pages that contain clippings of blog entries. Those are not as useful as the blog entry pages themselves. Here are a few more problems:

  1. Assuming the blog data will be aggregated (multiple blogs and sources), how can tags be made for blogs such that it can be used in a Zettelkasten or BASB (e.g. LogSeq)?
  2. Are there similarities between document embedding vectors for AI processing, and human-readable document summaries for human browsing?
  3. How would it handle paragraph-level data and document-level data?
yogeshchandrasekharuni commented 4 weeks ago

Would love to contribute and help accelerate the "smart crawler" - pls pull me into the discord - yogeshchandrasekharuni@gmail.com

unclecode commented 3 weeks ago

@yogeshchandrasekharuni You are very welcome, I will sen invitation link very soon.

unclecode commented 3 weeks ago

@BradKML Follow up on your questions:

3. Handling Paginated Blog Layouts

For blog layouts that show article snippets/previews, we recommend a two-phase crawling approach:

  1. First, crawl the index pages to collect all article URLs
  2. Then perform deep nested crawling to fetch the full content of each article

We also recommend creating an embedding layer and storing these in a database, as this enables semantic querying capabilities later on. This approach ensures you capture both the structure and full content while maintaining searchability.

5. Tags for Zettelkasten/BASB Systems

There are several approaches available in our library for generating tags:

6. Document Embeddings vs Human-Readable Summaries

We implement a hybrid approach combining statistical methods and LLMs:

  1. LLM-based approach:

    • Extract content
    • Pass to language model
    • Generate human-readable summaries
  2. Semantic clustering approach:

    • Use cosine similarity to identify semantic chunks
    • Summarize each segment independently
    • Combine into a cohesive summary

This hybrid approach leverages both statistical methods and LLMs, with the natural language instructions bringing in human-centric understanding.

7. Paragraph vs Document-Level Processing

Our library offers flexible chunking strategies:

Default approach:

Customization options:

The cosine similarity approach helps cluster documents into semantically relevant chunks, ensuring coherent content organization regardless of the chosen chunking strategy.