[Request] 在 rag 中包含文件的元信息

muhanstudio commented 3 days ago

🥰 需求描述

现有的单个文档上传解析虽然已经够用，但是有一个比较明显的缺陷是没有在提问时中携带文件的元信息，例如文件名和文件的大小，在例如ChatGPT官方的文档解析功能中，我们提问AI, 我们上传了一个什么样的文件？文件名是什么？文件拓展名是什么？文件的大小是多大？AI 都是可以知道的。但是 Lobe 中的 rag 并没携带这些信息，导致了模型并不知道我们的文件元信息，缺少了一部分数据参考，希望可以添加这个比较简单的功能，可以让AI更好的理解我们上传的整个文件的整体是什么样的，我们上传了一个什么文件，而不是只是作为简单的将文件切割然后将分块添加到上下文中

🧐 解决方案

可以通过修改提示词，将文档的元信息包含到我们发送的数据分块中，更利于 AI 理解我们究竟在对一个什么样的文件进行提问

📝 补充信息

No response

lobehubbot commented 3 days ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

🥰 Description of requirements

Although the existing single document upload parsing is sufficient, one obvious flaw is that it does not carry the meta-information of the file when asking questions, such as the file name and file size. For example, in the official document parsing function of ChatGPT, we ask AI, what kind of file did we upload? What is the file name? What is the file extension? What is the size of the file? AI can know everything. However, the rag in Lobe does not carry this information, resulting in the model not knowing our file meta-information and missing part of the data reference. I hope that this relatively simple function can be added to allow AI to better understand the entire file we uploaded. What does the whole look like, what file do we upload, rather than just slicing the file up and then adding the chunks into the context

🧐 Solution

We can modify the prompt words to include the meta-information of the document into the data chunks we send, which is more conducive for AI to understand what kind of file we are asking about.

📝 Supplementary information

No response

lobehubbot commented 3 days ago

👀 @muhanstudio

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible. Please make sure you have given us as much context as possible.\ 非常感谢您提交 issue。我们会尽快调查此事，并尽快回复您。请确保您已经提供了尽可能多的背景信息。

muhanstudio commented 3 days ago

lobehubbot commented 3 days ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

arvinxx commented 3 days ago

要加感觉是简单的，但是这部分元信息加上去对于用户的帮助是什么？有没有更加 solid 一些的应用场景？

lobehubbot commented 3 days ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

It's easy to add a feeling, but how does adding this meta-information help users? Are there any more solid application scenarios?

muhanstudio commented 3 days ago

要加感觉是简单的，但是这部分元信息加上去对于用户的帮助是什么？有没有更加 solid 一些的应用场景？

一个最基础的应用场景就是多文档提问中可以让AI对指定的文档进行索引，例如，如果在一次对话中上传了多个文档，当自己想总结指定的文档时，可以说帮我总结一下 xxx 文档的内容，AI就可以通过文件名来区分不同的数据分块，然后只对指定的数据分块进行总结，这样就不会和其他文档进行混淆了，有时候，一个词或者一句话在不同的文档中也会有不同的定义，比如，两个文档可能包含了不同的网络结构定义，或者说不同的技术手册，这样我们就可以分开提问，比如说 xxx 文档中网络结构是什么样的，在 xxxx 文档中某个技术名词又是怎么写的，怎么说的。我们还可以通过拓展名来做更多的事情，比如告诉 AI ，根据我上传的 xxx.pdf文档（可能是公司的编程代码规范）中的说明和要求，来对我上传的所有 py 程序进行注释和重构，可以区分不同文档的不同作用，从而达到让AI通过一个文档来对另一个文档进行操作的目的，这些在ChatGPT的官网都可以轻松的完成，但是不是每个人都有plus 或者可以订阅，大多数人只能使用API，我相信这会是一个非常好的特性，也有强大的应用场景

lobehubbot commented 3 days ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

It’s easy to add a feeling, but how will adding this meta-information help users? Are there any more solid application scenarios?

One of the most basic application scenarios is that AI can be used to index specified documents in multi-document questions. For example, if multiple documents are uploaded in a conversation, and you want to summarize the specified documents, you can say, "Help me summarize xxx" Based on the content of the document, AI can distinguish different data chunks through file names, and then summarize only the specified data chunks, so that they will not be confused with other documents. Sometimes, a word or sentence is used in different There will also be different definitions in the documents. For example, the two documents may contain different network structure definitions, or different technical manuals, so that we can ask questions separately, for example, what is the network structure in the xxx document? How is a certain technical term written and said in the xxxx document? We can also do more things through extensions, such as telling AI to annotate all py programs I uploaded according to the instructions and requirements in the xxx.pdf document I uploaded (which may be the company's programming code specification) and reconstruction, you can distinguish the different functions of different documents, so as to achieve the purpose of letting AI operate on another document through one document. These can be easily completed on the official website of ChatGPT, but not everyone has plus or can subscribe. , most people can only use the API. I believe this will be a very good feature and has powerful application scenarios.

muhanstudio commented 3 days ago

要加感觉是简单的，但是这部分元信息加上去对于用户的帮助是什么？有没有更加 solid 一些的应用场景？

除此之外，我在官网经常做的一件事情就是上传多个营业统计数据文档，然后可以让AI分别告诉我每个月营业情况，因为我的文件往往是以日期或者季度来命名的，我可以直接提问在 x 月 /x 季度，营业情况是怎么样的？生成一份报告。或者直接说出具体日期，AI可以自动匹配文件名，从而对指定文件进行总结

lobehubbot commented 3 days ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

It’s easy to add a feeling, but how will adding this meta-information help users? Are there any more solid application scenarios?

In addition, one thing I often do on the official website is to upload multiple business statistics documents, and then let AI tell me the business status of each month, because my files are often named after dates or quarters. I can directly ask what the business situation was like in x month/x quarter? Generate a report. Or just say the specific date, and AI can automatically match the file name to summarize the specified file.

muhanstudio commented 3 days ago

New Chat 2

lobehubbot commented 3 days ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

muhanstudio commented 20 hours ago

我又找到了一个比较现成的项目https://github.com/danny-avila/rag_api ，在这个项目中他们提供了一种基于文件级别的嵌入，在我们与文档对话的过程中，可以清楚地区分不同的文档，提供一种更接近于目前比较通用的官网文档解析的功能，而不只是对话类似一整个大的知识库，库里有各种杂糅的文档片段，因为对于用户来说，当仅在界面上进行操作时，我们只需要对指定文档具有很好的认识就可以了，通过最简单的方法有效的提高了RAG检索引用的准确性，达到了一种接近官网上传文档对话的效果，希望您可以考虑一下

https://github.com/danny-avila/rag_api/blob/9c65628789e6efe0a44877dfa4e5ec1e8e11dc31/README.md?plain=1#L8

lobehubbot commented 20 hours ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

I found another relatively ready-made project https://github.com/danny-avila/rag_api. In this project, they provide a file-level embedding. During our dialogue with the document, we can clearly understand the area. Divide different documents into different documents to provide a function that is closer to the current common official website document parsing. Instead of just talking like a large knowledge base, there are various miscellaneous document fragments in the library, because for users, when When operating only on the interface, we only need to have a good understanding of the specified document. Through the simplest method, the accuracy of RAG retrieval and reference is effectively improved, achieving an effect close to the dialogue of uploading documents on the official website.

lobehub / lobe-chat