Re-ranking model too slow

snexus / llm-search

Querying local documents, powered by LLM

MIT License

519 stars 60 forks source link

Re-ranking model too slow #96

Closed mohammad-yousuf closed 8 months ago

mohammad-yousuf commented 9 months ago

Hi,

I have been using the tool but the problem is that re-ranking model "bge-reranker-base" is too slow. If I use "macro", there is significant decrease in accuracy. Do you have any suggestions on how can I optimize this?

My hardware:

1 x H100 80GB PCIe, 32 vCPU 251 GB RAM

snexus commented 9 months ago

Hi, It shouldn't be too slow for the base reranker. I have significantly smaller system (3060 vs 10GB VRAM) and takes under 1 sec. How long does it take in your case?

You can try to reduce number of document to retrieve, using this parameter: https://github.com/snexus/llm-search/blob/fc69a69f504459ff64d59fb85696b46d640611e7/sample_templates/obsidian_conf.yaml#L46

mohammad-yousuf commented 9 months ago

@snexus apparently it was some server issue. It is working fine now. Thank you!

Do you have plans to add vLLM support?

snexus commented 9 months ago

It is supported indirectly since vLLM has OpenAI server implementation - https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server

Basically you can start vlllm in the server mode and connect using this package, specifying the local endpoint. This package was also tested to work with https://github.com/BerriAI/litellm which supports vllm, ollama and many other models and frameworks.

Here is an example how to configure the connection (it should work with LiteLLM and vllm's openai server) - https://github.com/snexus/llm-search/blob/main/sample_templates/llm/litellm.yaml

snexus commented 9 months ago

Also, if possible, would like to get some feedback on the quality of the responses - are you getting value from the package? What % of responses would you say are useful/accurate?

mohammad-yousuf commented 9 months ago

@snexus can I use vLLM instead of llama-cpp-python for inference with mixtral? That would be great.

The responses are quite good. I would say the accuracy is 80%. The problem is when I use this model with TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF with llama-cpp it gives me error on page reload: ValueError(f"Failed to load model from file: {path_model}"). Because of this I am not able to use the large mixtral model and I had to shift to this model TheBloke/Mistral-7B-Instruct-v0.2-GGUF and it doesn't give me error. When I shift to this model, the accuracy decreased. Langchain vLLM support would be great since it has distributed inference as well and much faster than llama-cpp-python.

snexus commented 9 months ago

You can try to install the vLLM and run it in OpenAI server mode locally. Then you can configure llmsearch to work with this endpoint as I described in my previous message.

An alternative approach would be to use LiteLLM in openai server mode + llmsearch configured for LiteLLMs endpoint. It supports vLLM and many other frameworks.

mohammad-yousuf commented 9 months ago

Great. Thanks @snexus !

One more suggestion about adding Llamaparser for PDFs as I heard its very good for data like tables.

snexus commented 9 months ago

One more suggestion about adding Llamaparser for PDFs as I heard its very good for data like tables.

Are you able to open please a separate issue for that? I will explore Llamaprser

mohammad-yousuf commented 9 months ago

@snexus for sure.

snexus commented 8 months ago

Are we ok to close this @mohammad-yousuf ?