Closed mohammad-yousuf closed 8 months ago
Hi, It shouldn't be too slow for the base reranker. I have significantly smaller system (3060 vs 10GB VRAM) and takes under 1 sec. How long does it take in your case?
You can try to reduce number of document to retrieve, using this parameter: https://github.com/snexus/llm-search/blob/fc69a69f504459ff64d59fb85696b46d640611e7/sample_templates/obsidian_conf.yaml#L46
@snexus apparently it was some server issue. It is working fine now. Thank you!
Do you have plans to add vLLM support?
It is supported indirectly since vLLM has OpenAI server implementation - https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server
Basically you can start vlllm in the server mode and connect using this package, specifying the local endpoint. This package was also tested to work with https://github.com/BerriAI/litellm which supports vllm, ollama and many other models and frameworks.
Here is an example how to configure the connection (it should work with LiteLLM and vllm's openai server) - https://github.com/snexus/llm-search/blob/main/sample_templates/llm/litellm.yaml
Also, if possible, would like to get some feedback on the quality of the responses - are you getting value from the package? What % of responses would you say are useful/accurate?
@snexus can I use vLLM instead of llama-cpp-python for inference with mixtral? That would be great.
The responses are quite good. I would say the accuracy is 80%. The problem is when I use this model with TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
with llama-cpp it gives me error on page reload: ValueError(f"Failed to load model from file: {path_model}"). Because of this I am not able to use the large mixtral model and I had to shift to this model TheBloke/Mistral-7B-Instruct-v0.2-GGUF
and it doesn't give me error. When I shift to this model, the accuracy decreased. Langchain vLLM support would be great since it has distributed inference as well and much faster than llama-cpp-python.
You can try to install the vLLM and run it in OpenAI server mode locally. Then you can configure llmsearch to work with this endpoint as I described in my previous message.
An alternative approach would be to use LiteLLM in openai server mode + llmsearch configured for LiteLLMs endpoint. It supports vLLM and many other frameworks.
Great. Thanks @snexus !
One more suggestion about adding Llamaparser for PDFs as I heard its very good for data like tables.
One more suggestion about adding Llamaparser for PDFs as I heard its very good for data like tables.
Are you able to open please a separate issue for that? I will explore Llamaprser
@snexus for sure.
Are we ok to close this @mohammad-yousuf ?
Hi,
I have been using the tool but the problem is that re-ranking model "bge-reranker-base" is too slow. If I use "macro", there is significant decrease in accuracy. Do you have any suggestions on how can I optimize this?
My hardware: