I'm encountering some delay in an answer being formulated. This happens with a custom GGML model. The text generation is blazing fast (both with llama.cpp and through ctransformers, especially with layers offloaded to GPU) which makes me think that it's not the model's fault.
With chatdocs I have an ~8s delay when chatting. Is this the price of the QA function? Or is there something wrong with the way I calculate embeddings to search against the vector store?
Hi,
First of all, many thanks for this fantastic lib.
I'm encountering some delay in an answer being formulated. This happens with a custom GGML model. The text generation is blazing fast (both with llama.cpp and through ctransformers, especially with layers offloaded to GPU) which makes me think that it's not the model's fault.
With chatdocs I have an ~8s delay when chatting. Is this the price of the QA function? Or is there something wrong with the way I calculate embeddings to search against the vector store?
Many thanks
Hardware: