Performance Issue with Batch Insert in /datasets/{dataset_id}/documents/{document_id}/segments API

Angelxiaotao commented 3 weeks ago

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.10.1

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

Use the /datasets/{dataset_id}/documents/{document_id}/segments API to insert a batch of segments. Attempt to insert more than 200 segments in a single request. Observe the performance slowdown as the segment count approaches or exceeds 1000 entries.

✔️ Expected Behavior

Could you provide guidance on optimizing this process, or consider implementing performance improvements for batch insert operations on this endpoint? Support for higher throughput or more efficient bulk operations would be highly beneficial for users with large datasets.

❌ Actual Behavior

Inserting batches of more than 200 segments leads to slow performance, which becomes especially problematic when handling large datasets that require high-volume segment insertions.

dosubot[bot] commented 3 weeks ago

To address the performance issues with batch insert operations in the /datasets/{dataset_id}/documents/{document_id}/segments API, especially when dealing with large datasets, consider the following optimizations:

Asynchronous Processing: The current implementation uses asynchronous processing with a task queue to handle batch inserts. This allows the API to return a response immediately without waiting for the entire batch to be processed. Ensure that this is properly configured and utilized to offload processing from the main application thread ^[1].
Task Queue Configuration: The task queue is configured using Celery. You can adjust the concurrency level to allow more tasks to be processed simultaneously, which might improve throughput. For example, increasing the concurrency level from 1 to a higher number could help, depending on your system's resources ^[2].
Database Indexes: The document_segments table has several indexes that could impact performance. While indexes are necessary for query performance, they can slow down insert operations. Review these indexes to ensure they are necessary and consider optimizing them if possible ^[3].
Batch Size: If possible, try to insert segments in smaller batches. While the system is designed to handle large batches, breaking them into smaller chunks might reduce the load and improve performance.
Monitoring and Logging: Implement detailed logging and monitoring to identify specific bottlenecks during the insert process. This can provide insights into where the performance issues are occurring and help in fine-tuning the system.

These strategies should help in optimizing the batch insert operations and improving the overall performance of the API.

To continue talking to Dosu, mention @dosu.

JohnJyong commented 3 weeks ago

Please describe clearly what performance degradation refers to, whether it is database processing time, interface return time, or Embedding time.

Angelxiaotao commented 3 weeks ago

@JohnJyong We are currently experiencing a particularly slow response to requests from this interface. Our server is configured with a 4-core CPU, 30GB of memory, 500GB of hard drive, and 16GB of GPU with graphics memory. In this configuration, when inserting more than 200 segments in bulk, the interface response time is significantly prolonged, which affects the processing efficiency of large amounts of data. Do you have any optimization suggestions or further performance improvement plans?

langgenius / dify