infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
18.54k stars 1.88k forks source link

[Feature Request]Using Ragflow for Document Preprocessing with custom chunking strategies #568

Open JahnKhan opened 5 months ago

JahnKhan commented 5 months ago

Describe your problem

Hi, i am currently working on a project where the way documents are segmented into chunks is crucial and varies depending on the specific task at hand. For example, in a dictionary, it is useful to segment the txt into word-and-explanation pairs. I am interested in using ragflow for the preprocessing phase of my project. Specifically, i would like to know:

  1. can ragflow be configured to perform custom chunking of documents? For instance, can it segment documents based on specific delimiters or structural patterns unique for the content being processed?
  2. is it possible to use raflow solely for the purpose of preprocessing data, where i can specify how the documents should be chunked ?

i would like to have a tool, that can preprocess my documents and show me visualy how the chunks are created. Mark it on the documents itself so i can see visually how the document is segmented and if necessary, change it by only marking a smaller or bigger text area.

thank you very much

yingfeng commented 5 months ago

Hi,

  1. Currrently, ragflow can not adopt a customized chunking approach. But it's not a difficult requirement just according to some pattern. Perhaps we could provide that later.
  2. We are going to provide API for that purpose, to provide the chunked results through API.