labring / FastGPT

FastGPT is a knowledge-based platform built on the LLMs, offers a comprehensive suite of out-of-the-box capabilities such as data processing, RAG retrieval, and visual AI workflow orchestration, letting you easily develop and deploy complex question-answering systems without the need for extensive setup or configuration.
https://fastgpt.in
Other
16.95k stars 4.53k forks source link

The knowledge base returns the data id after writing the data. #800

Open suwubee opened 7 months ago

suwubee commented 7 months ago

例行检查

功能描述 dataset/data/pushData 演示的是批量数据的写入,能否返回写入数据的具体ID

应用场景 通过知识库写入可以记录某些客户反馈的数据,或者对知识库内容的调整,这样就需要返回具体的ID,而批量数据的写入,也可以使用轮询的多次请求来实现,个人认为并没有太大必要为了批量写入而舍去了返回的字段。

相关示例

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Routine inspection

Function description dataset/data/pushData demonstrates the writing of batch data. Can it return the specific ID of the written data?

Application Scenario Writing to the knowledge base can record certain customer feedback data, or adjust the content of the knowledge base, so that specific IDs need to be returned, and batch data writing can also be achieved using multiple polling requests. Personally I think it is not necessary to discard the returned fields for batch writing.

Related Examples

c121914yu commented 7 months ago

这个有点困难,插入数据并不是同步的,而是先经过推入训练队列,训练完了再插入,相当于没法实时获取插入的ID.

suwubee commented 7 months ago

这个有点困难,插入数据并不是同步的,而是先经过推入训练队列,训练完了再插入,相当于没法实时获取插入的ID.

个人建议这个就更应该预先占位ID了。否则数据插入了,但是结果,甚至顺序都是不可控的。如果是api级别对接精准数据的时候,容易混乱,定向插入的数据没法对应找回。

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


This is a bit difficult. Inserting data is not synchronous. Instead, it is pushed into the training queue first, and then inserted after training. This means that the inserted ID cannot be obtained in real time.

Personally, I suggest that you should occupy the ID in advance. Otherwise, the data is inserted, but the results and even the order are uncontrollable. If the precise data is connected at the API level, it is easy to get confused, and the data inserted in the direction cannot be retrieved accordingly.

suwubee commented 7 months ago

别家的不清楚,我对大规模数据的量化的理解是,我队列100条编好ID的数据放进去向量,肯定是需要对应的向量化后的id,这样日后对文本数据的维护、删除、重新量化,都比较好处理。 不然放进去的数据无法索引,无法定位,那也无法更大规模化,精准化。

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Others don’t know. My understanding of the quantification of large-scale data is that if I put 100 ID-coded data in my queue into the vector, I will definitely need the corresponding vectorized ID, so that in the future, I can maintain, delete, and delete text data. Re-quantification is easier to handle. Otherwise, the data put in cannot be indexed or located, and it cannot be scaled up or made more accurate.

c121914yu commented 7 months ago

别家的不清楚,我对大规模数据的量化的理解是,我队列100条编好ID的数据放进去向量,肯定是需要对应的向量化后的id,这样日后对文本数据的维护、删除、重新量化,都比较好处理。 不然放进去的数据无法索引,无法定位,那也无法更大规模化,精准化。

并不是,目前id均是由数据库自增得到的,并不会由外部决定。

suwubee commented 7 months ago

别家的不清楚,我对大规模数据的量化的理解是,我队列100条编好ID的数据放进去向量,肯定是需要对应的向量化后的id,这样日后对文本数据的维护、删除、重新量化,都比较好处理。 不然放进去的数据无法索引,无法定位,那也无法更大规模化,精准化。

并不是,目前id均是由数据库自增得到的,并不会由外部决定。

还是返回个预留ID吧,现在数据扔进去就找不到了

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Others are not sure. My understanding of the quantification of large-scale data is that if I put 100 ID-coded data in my queue into the vector, I will definitely need the corresponding vectorized ID, so that in the future maintenance of text data, Deletion and re-quantification are easier to handle. Otherwise, the data put in cannot be indexed or located, and it cannot be scaled up or made more accurate.

No, currently IDs are automatically incremented by the database and will not be determined externally.

Let's return a reserved ID. Now that the data is thrown in, it can't be found.

c121914yu commented 7 months ago

别家的不清楚,我对大规模数据的量化的理解是,我队列100条编好ID的数据放进去向量,肯定是需要对应的向量化后的id,这样日后对文本数据的维护、删除、重新量化,都比较好处理。 不然放进去的数据无法索引,无法定位,那也无法更大规模化,精准化。

并不是,目前id均是由数据库自增得到的,并不会由外部决定。

还是返回个预留ID吧,现在数据扔进去就找不到了

预留是不可能的,因为会有分布式问题。只能再考虑方案。

suwubee commented 7 months ago

别家的不清楚,我对大规模数据的量化的理解是,我队列100条编好ID的数据放进去向量,肯定是需要对应的向量化后的id,这样日后对文本数据的维护、删除、重新量化,都比较好处理。 不然放进去的数据无法索引,无法定位,那也无法更大规模化,精准化。

并不是,目前id均是由数据库自增得到的,并不会由外部决定。

还是返回个预留ID吧,现在数据扔进去就找不到了

预留是不可能的,因为会有分布式问题。只能再考虑方案。

mongodb里设置一个额外预留pid字段,向量跑完了再填充其他字段的值,只是这条数据的更新而已,逻辑上自增id早就在预留的时候同步到主从了,除非主从还没同步向量就返回了,但实际不会这么快的。

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Others are not sure. My understanding of the quantification of large-scale data is that if I put 100 pieces of data with IDs in my queue into the vector, the corresponding vectorized IDs must be needed, so that in the future, the text data can be processed Maintenance, deletion, and re-quantification are all easier to handle. Otherwise, the data put in cannot be indexed or located, and it cannot be scaled up or made more accurate.

No, currently the IDs are automatically incremented by the database and will not be determined externally.

Let's return a reserved ID. Now that the data is thrown in, it can't be found.

Reservation is not possible because of distribution issues. We can only consider the plan again.

Set up an additional reserved pid field in mongodb. After the vector runs out, the values ​​of other fields will be filled in. This is just an update of this data. Logically, the auto-incremented id has already been synchronized to the master-slave when it is reserved, unless the master-slave returns. It returns without synchronizing the vector, but it won't actually be that fast.