[Feat] Scrape markdown response ideally is splitted in seperate chunks / parts

ChrisMeye commented 3 months ago

Problem Description Right now the scrape endpoint returns a huge string with the complete markdown of one URL. Storing the whole page in a vector store as one embedding is not ideal. Its better to deconstruct the page into sections.

Proposed Feature I would like to have an e.g. URL parameter that i can set e.g. &chunks=true to get an array of strings - representing the structure. Ideally context based but at least based on the HTML structure.

Implementation Suggestions A simple and fast way might be to split by e.g. a lot of spaces / line breaks or other characteristics because that might be a new section. But clearly the better approach would be to somehow pass the markdown through a LLM and get chunks from it that are contextually close.

Use Case It is a really good advice for LLM's and vector stores not to get the whole page at once and store one embedding per page instead having smaller chunks / parts to choose from. Therefore its in my opinion a super imortant part of the project.

mattjoyce commented 3 months ago

@ChrisMeye , the idea has merit, but the problem is there are a lot of different chunking strategies, and the right one is going to be dependent on the specific needs. What strategy do you think would be a good default?

mattjoyce commented 3 months ago

Interesting reading 5_Levels_Of_Text_Splitting Embedding short and long content

rafaelsideguide commented 3 months ago

Duplicate #241

rafaelsideguide commented 2 months ago

@calebpeffer 's:

Problem Description Users need to detect distinct sections within a webpage and have these sections represented in the markdown output.

Proposed Feature Implement a feature that identifies distinct sections in a webpage and includes these sections in the markdown output.

Use HTML structure (e.g., headings) to detect sections. Update the markdown generator to include section markers. Use Case It helps users navigate and understand content structure, making it easier to extract relevant information from long articles.

Additional Context Similar to section detection in content management systems and markdown editors.

Qualzz commented 2 months ago

this could be combined with onlyIncludeTags.

For exemple: <div class="toto">text A</div> <div class="toto">text B</div> <div class="toto">text C</div> If I use onlyIncludeTags the result will be Text A Text B Text C which is not ideal.

an output that looks like

mardown: [
"Text A",
"Text B",
"Text C"
]

will be tremendously better to use.

Qualzz commented 1 month ago

Is there anything new about this ? Or is it not planned ?

mendableai / firecrawl

[Feat] Scrape markdown response ideally is splitted in seperate chunks / parts #245