run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.38k stars 5.2k forks source link

RAG using GGUF building a real time chat application #12320

Closed elliotthwang closed 7 months ago

elliotthwang commented 7 months ago

Feature Description

RAG using GGUF building a real time chat application

Reason

Adopting the new settings configuring on GGUF, it taking 32 minutes to come up with the response unacceptable for a real time chat application. DataDrivenInvestor's Amaresh Patnaik publihed the article "RAG using GGUF" to show:

RAG using GGUF

Strongly suggest reserving the serviecontext as legacy to enjoy the real time chat. The following figures are excellent servicecontext performances: llama_print_timings: load time = 440.50 ms llama_print_timings: sample time = 83.53 ms / 100 runs ( 0.84 ms per token, 1197.19 tokens per second) llama_print_timings: prompt eval time = 148.09 ms / 37 tokens ( 4.00 ms per token, 249.85 tokens per second) llama_print_timings: eval time = 1629.66 ms / 99 runs ( 16.46 ms per token, 60.75 tokens per second) llama_print_timings: total time = 2330.94 ms / 136 tokens

Value of Feature

Let LLM into AGI to support mobile QA(quality assurance) of daily operation in any organization.

dosubot[bot] commented 7 months ago

Hey @elliotthwang, great to see you back! 😄 Let me dive into this intriguing issue you've brought up, and I'll get back to you with some insights shortly. 🚀

elliotthwang commented 7 months ago

DataDrivenInvestor's Amaresh Patnaik publihed the article "RAG using GGUF": https://medium.datadriveninvestor.com/rag-using-gguf-a6a1bae49592

dosubot[bot] commented 7 months ago

The performance metrics you've shared indicate that the majority of the time is spent in the eval time and total time categories, suggesting that the bottleneck might be in the processing of the LLM requests and the overall processing of the chat messages. Here are some strategies to improve these areas:

  1. Optimize Chat History Management: The ChatMemoryBuffer class manages chat history using a buffer strategy. Optimizing this by implementing a more efficient way to handle history, such as limiting the number of messages stored or processed at any given time, can improve performance. Efficient token counting, batch processing for tokenization, implementing a caching strategy, data structure optimization, and asynchronous operations are also viable strategies.

  2. Asynchronous Processing: Implementing asynchronous processing can improve responsiveness. Python's asyncio library can be used for asynchronous operations. Modifying the _generate_response method to process chat responses asynchronously can prevent blocking the main thread and enhance user experience.

  3. Batch Processing for LLM Requests: Batching LLM requests can reduce the overhead of sending and receiving requests from the server. The LlamaIndex codebase employs an asynchronous programming model with semaphores for concurrency control, allowing for the parallel processing of multiple requests while limiting the number of concurrent operations.

  4. Optimize Tools Usage: Ensure tools like ArxivToolSpec and WikipediaToolSpec are optimized for performance. Caching responses to avoid repeated requests for the same information can also contribute to better performance.

  5. Frontend Performance: Since Gradio is used for the frontend, optimizing frontend components for performance is crucial. This includes minimizing the amount of data transferred between the backend and the frontend, optimizing rendering times, and efficiently updating the UI based on user interactions.

  6. Profiling and Monitoring: Use profiling tools to identify bottlenecks in your application. Python's cProfile module or third-party tools like Py-Spy can help you understand where your application spends most of its time. Monitoring tools can also help you track performance over time and identify areas for improvement.

Implementing these changes requires a good understanding of asynchronous programming in Python and might necessitate significant changes to both the backend logic and how the frontend interacts with it.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
logan-markewich commented 7 months ago

Sorry, but there is no clear action item from this github issue.

GGUFcan be slow, depending on the hardware and model used, as well as how much data you are sending to the llm

elliotthwang commented 7 months ago

@logan-markewich Servicecontext has already reached the online chat on GGUF, just reserve it into your legacy.