Closed elliotthwang closed 7 months ago
Hey @elliotthwang, great to see you back! 😄 Let me dive into this intriguing issue you've brought up, and I'll get back to you with some insights shortly. 🚀
DataDrivenInvestor's Amaresh Patnaik publihed the article "RAG using GGUF": https://medium.datadriveninvestor.com/rag-using-gguf-a6a1bae49592
The performance metrics you've shared indicate that the majority of the time is spent in the eval time
and total time
categories, suggesting that the bottleneck might be in the processing of the LLM requests and the overall processing of the chat messages. Here are some strategies to improve these areas:
Optimize Chat History Management: The ChatMemoryBuffer
class manages chat history using a buffer strategy. Optimizing this by implementing a more efficient way to handle history, such as limiting the number of messages stored or processed at any given time, can improve performance. Efficient token counting, batch processing for tokenization, implementing a caching strategy, data structure optimization, and asynchronous operations are also viable strategies.
Asynchronous Processing: Implementing asynchronous processing can improve responsiveness. Python's asyncio
library can be used for asynchronous operations. Modifying the _generate_response
method to process chat responses asynchronously can prevent blocking the main thread and enhance user experience.
Batch Processing for LLM Requests: Batching LLM requests can reduce the overhead of sending and receiving requests from the server. The LlamaIndex codebase employs an asynchronous programming model with semaphores for concurrency control, allowing for the parallel processing of multiple requests while limiting the number of concurrent operations.
Optimize Tools Usage: Ensure tools like ArxivToolSpec
and WikipediaToolSpec
are optimized for performance. Caching responses to avoid repeated requests for the same information can also contribute to better performance.
Frontend Performance: Since Gradio is used for the frontend, optimizing frontend components for performance is crucial. This includes minimizing the amount of data transferred between the backend and the frontend, optimizing rendering times, and efficiently updating the UI based on user interactions.
Profiling and Monitoring: Use profiling tools to identify bottlenecks in your application. Python's cProfile
module or third-party tools like Py-Spy can help you understand where your application spends most of its time. Monitoring tools can also help you track performance over time and identify areas for improvement.
Implementing these changes requires a good understanding of asynchronous programming in Python and might necessitate significant changes to both the backend logic and how the frontend interacts with it.
Sorry, but there is no clear action item from this github issue.
GGUFcan be slow, depending on the hardware and model used, as well as how much data you are sending to the llm
@logan-markewich Servicecontext has already reached the online chat on GGUF, just reserve it into your legacy.
Feature Description
RAG using GGUF building a real time chat application
Reason
Adopting the new settings configuring on GGUF, it taking 32 minutes to come up with the response unacceptable for a real time chat application. DataDrivenInvestor's Amaresh Patnaik publihed the article "RAG using GGUF" to show:
Strongly suggest reserving the serviecontext as legacy to enjoy the real time chat. The following figures are excellent servicecontext performances: llama_print_timings: load time = 440.50 ms llama_print_timings: sample time = 83.53 ms / 100 runs ( 0.84 ms per token, 1197.19 tokens per second) llama_print_timings: prompt eval time = 148.09 ms / 37 tokens ( 4.00 ms per token, 249.85 tokens per second) llama_print_timings: eval time = 1629.66 ms / 99 runs ( 16.46 ms per token, 60.75 tokens per second) llama_print_timings: total time = 2330.94 ms / 136 tokens
Value of Feature
Let LLM into AGI to support mobile QA(quality assurance) of daily operation in any organization.