The knowledge base returns the data id after writing the data. - Githubissues

labring / FastGPT

FastGPT is a knowledge-based platform built on the LLMs, offers a comprehensive suite of out-of-the-box capabilities such as data processing, RAG retrieval, and visual AI workflow orchestration, letting you easily develop and deploy complex question-answering systems without the need for extensive setup or configuration.

https://fastgpt.in

Other

16.95k stars 4.53k forks source link

The knowledge base returns the data id after writing the data. #800

Open suwubee opened 7 months ago

suwubee commented 7 months ago

例行检查

[x] 我已确认目前没有类似 features
[x] 我已确认我已升级到最新版本
[x] 我已完整查看过项目 README，已确定现有版本无法满足需求
[x] 我理解并愿意跟进此 features，协助测试和提供反馈
[x] 我理解并认可上述内容，并理解项目维护者精力有限，不遵循规则的 features 可能会被无视或直接关闭

功能描述 dataset/data/pushData 演示的是批量数据的写入，能否返回写入数据的具体ID

应用场景 通过知识库写入可以记录某些客户反馈的数据，或者对知识库内容的调整，这样就需要返回具体的ID，而批量数据的写入，也可以使用轮询的多次请求来实现，个人认为并没有太大必要为了批量写入而舍去了返回的字段。

相关示例

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Routine inspection

[x] I have confirmed that there are currently no similar features
[x] I have confirmed that I have upgraded to the latest version
[x] I have fully reviewed the project README and determined that the existing version cannot meet the needs.
[x] I understand and am willing to follow up on this feature, assist with testing and provide feedback
[x] I understand and agree with the above content, and understand that project maintainers have limited energy. Features that do not follow the rules may be ignored or closed directly

Function description dataset/data/pushData demonstrates the writing of batch data. Can it return the specific ID of the written data?

Application Scenario Writing to the knowledge base can record certain customer feedback data, or adjust the content of the knowledge base, so that specific IDs need to be returned, and batch data writing can also be achieved using multiple polling requests. Personally I think it is not necessary to discard the returned fields for batch writing.

Related Examples

c121914yu commented 7 months ago

这个有点困难，插入数据并不是同步的，而是先经过推入训练队列，训练完了再插入，相当于没法实时获取插入的ID.

suwubee commented 7 months ago

这个有点困难，插入数据并不是同步的，而是先经过推入训练队列，训练完了再插入，相当于没法实时获取插入的ID.

个人建议这个就更应该预先占位ID了。否则数据插入了，但是结果，甚至顺序都是不可控的。如果是api级别对接精准数据的时候，容易混乱，定向插入的数据没法对应找回。

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

This is a bit difficult. Inserting data is not synchronous. Instead, it is pushed into the training queue first, and then inserted after training. This means that the inserted ID cannot be obtained in real time.

Personally, I suggest that you should occupy the ID in advance. Otherwise, the data is inserted, but the results and even the order are uncontrollable. If the precise data is connected at the API level, it is easy to get confused, and the data inserted in the direction cannot be retrieved accordingly.

suwubee commented 7 months ago

别家的不清楚，我对大规模数据的量化的理解是，我队列100条编好ID的数据放进去向量，肯定是需要对应的向量化后的id，这样日后对文本数据的维护、删除、重新量化，都比较好处理。不然放进去的数据无法索引，无法定位，那也无法更大规模化，精准化。

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Others don’t know. My understanding of the quantification of large-scale data is that if I put 100 ID-coded data in my queue into the vector, I will definitely need the corresponding vectorized ID, so that in the future, I can maintain, delete, and delete text data. Re-quantification is easier to handle. Otherwise, the data put in cannot be indexed or located, and it cannot be scaled up or made more accurate.

c121914yu commented 7 months ago

别家的不清楚，我对大规模数据的量化的理解是，我队列100条编好ID的数据放进去向量，肯定是需要对应的向量化后的id，这样日后对文本数据的维护、删除、重新量化，都比较好处理。不然放进去的数据无法索引，无法定位，那也无法更大规模化，精准化。

并不是，目前id均是由数据库自增得到的，并不会由外部决定。

suwubee commented 7 months ago

别家的不清楚，我对大规模数据的量化的理解是，我队列100条编好ID的数据放进去向量，肯定是需要对应的向量化后的id，这样日后对文本数据的维护、删除、重新量化，都比较好处理。不然放进去的数据无法索引，无法定位，那也无法更大规模化，精准化。

并不是，目前id均是由数据库自增得到的，并不会由外部决定。

还是返回个预留ID吧，现在数据扔进去就找不到了

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Others are not sure. My understanding of the quantification of large-scale data is that if I put 100 ID-coded data in my queue into the vector, I will definitely need the corresponding vectorized ID, so that in the future maintenance of text data, Deletion and re-quantification are easier to handle. Otherwise, the data put in cannot be indexed or located, and it cannot be scaled up or made more accurate.

No, currently IDs are automatically incremented by the database and will not be determined externally.

Let's return a reserved ID. Now that the data is thrown in, it can't be found.

c121914yu commented 7 months ago

别家的不清楚，我对大规模数据的量化的理解是，我队列100条编好ID的数据放进去向量，肯定是需要对应的向量化后的id，这样日后对文本数据的维护、删除、重新量化，都比较好处理。不然放进去的数据无法索引，无法定位，那也无法更大规模化，精准化。

并不是，目前id均是由数据库自增得到的，并不会由外部决定。

还是返回个预留ID吧，现在数据扔进去就找不到了

预留是不可能的，因为会有分布式问题。只能再考虑方案。

suwubee commented 7 months ago

别家的不清楚，我对大规模数据的量化的理解是，我队列100条编好ID的数据放进去向量，肯定是需要对应的向量化后的id，这样日后对文本数据的维护、删除、重新量化，都比较好处理。不然放进去的数据无法索引，无法定位，那也无法更大规模化，精准化。

并不是，目前id均是由数据库自增得到的，并不会由外部决定。

还是返回个预留ID吧，现在数据扔进去就找不到了

预留是不可能的，因为会有分布式问题。只能再考虑方案。

mongodb里设置一个额外预留pid字段，向量跑完了再填充其他字段的值，只是这条数据的更新而已，逻辑上自增id早就在预留的时候同步到主从了，除非主从还没同步向量就返回了，但实际不会这么快的。

c121914yu commented 7 months ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Others are not sure. My understanding of the quantification of large-scale data is that if I put 100 pieces of data with IDs in my queue into the vector, the corresponding vectorized IDs must be needed, so that in the future, the text data can be processed Maintenance, deletion, and re-quantification are all easier to handle. Otherwise, the data put in cannot be indexed or located, and it cannot be scaled up or made more accurate.

No, currently the IDs are automatically incremented by the database and will not be determined externally.

Let's return a reserved ID. Now that the data is thrown in, it can't be found.

Reservation is not possible because of distribution issues. We can only consider the plan again.

Set up an additional reserved pid field in mongodb. After the vector runs out, the values of other fields will be filled in. This is just an update of this data. Logically, the auto-incremented id has already been synchronized to the master-slave when it is reserved, unless the master-slave returns. It returns without synchronizing the vector, but it won't actually be that fast.