infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
16.5k stars 1.69k forks source link

[Question]: What is the format of the text chunk? How to decide the rank of chunk of retrieval ? #1349

Open tao2021 opened 2 months ago

tao2021 commented 2 months ago

Describe your problem

上传文档后,解析的结果会存成什么格式,在输入问题后,对文档切片结果呈现的优先级是什么?是按照文档进行排序的吗?

JinHai-CN commented 2 months ago

We intend to create an international community, so we encourage using English for communication.

  1. Plain Text
  2. Using embedding and BM25 similarity.
tao2021 commented 2 months ago

We intend to create an international community, so we encourage using English for communication.

1. Plain Text

2. Using embedding and BM25 similarity.

Does this sorting method have anything to do with the documents passed in? Or do all the documents get parsed and stored in a large pool, and then extracted based on their similarity to the question? For example, if document A and document B are passed in, will the difference between the documents affect the calculation of similarity?

JinHai-CN commented 2 months ago

You'd better to check the RAG process which is not impacted by the order of document ingestion. As for the second question, they won't affect the similarity calculation itself, but will affect the rank of the retrieval.