langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.18k stars 7.19k forks source link

Performance Issue with Batch Insert in /datasets/{dataset_id}/documents/{document_id}/segments API #10094

Open Angelxiaotao opened 2 hours ago

Angelxiaotao commented 2 hours ago

Self Checks

Dify version

0.10.1

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

Use the /datasets/{dataset_id}/documents/{document_id}/segments API to insert a batch of segments. Attempt to insert more than 200 segments in a single request. Observe the performance slowdown as the segment count approaches or exceeds 1000 entries.

✔️ Expected Behavior

Could you provide guidance on optimizing this process, or consider implementing performance improvements for batch insert operations on this endpoint? Support for higher throughput or more efficient bulk operations would be highly beneficial for users with large datasets.

❌ Actual Behavior

Inserting batches of more than 200 segments leads to slow performance, which becomes especially problematic when handling large datasets that require high-volume segment insertions.

dosubot[bot] commented 2 hours ago

To address the performance issues with batch insert operations in the /datasets/{dataset_id}/documents/{document_id}/segments API, especially when dealing with large datasets, consider the following optimizations:

  1. Asynchronous Processing: The current implementation uses asynchronous processing with a task queue to handle batch inserts. This allows the API to return a response immediately without waiting for the entire batch to be processed. Ensure that this is properly configured and utilized to offload processing from the main application thread [1].

  2. Task Queue Configuration: The task queue is configured using Celery. You can adjust the concurrency level to allow more tasks to be processed simultaneously, which might improve throughput. For example, increasing the concurrency level from 1 to a higher number could help, depending on your system's resources [2].

  3. Database Indexes: The document_segments table has several indexes that could impact performance. While indexes are necessary for query performance, they can slow down insert operations. Review these indexes to ensure they are necessary and consider optimizing them if possible [3].

  4. Batch Size: If possible, try to insert segments in smaller batches. While the system is designed to handle large batches, breaking them into smaller chunks might reduce the load and improve performance.

  5. Monitoring and Logging: Implement detailed logging and monitoring to identify specific bottlenecks during the insert process. This can provide insights into where the performance issues are occurring and help in fine-tuning the system.

These strategies should help in optimizing the batch insert operations and improving the overall performance of the API.

To continue talking to Dosu, mention @dosu.