run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.74k stars 4.9k forks source link

[Question]: RAG on persistent VectoreStore with local Huggingface LLM #13478

Closed blackjackcola closed 1 week ago

blackjackcola commented 3 months ago

Question Validation

Question

Hello ! I m currently trying to use a opensource LLM for my RAG application using Llamaindex. While it works perfectly fine for the ingestion process , I cant get it working for the retrieval on a persisten Vector index. Llamaindex is ignoring my LLM Setting and trying to default Openai . I m using a local LLM on Gpu . Does anyone know a quickfix for this specific issue ? I m not a professional in machine learning , so I was hoping that somebody already had that issue and solved it . Thank you !

dosubot[bot] commented 3 months ago

To address the issue where LlamaIndex defaults to using OpenAI instead of your local Large Language Model (LLM) for your RAG application, ensure you're specifying your local model correctly in the configuration. When initializing or configuring your application to use the local LLM, you should explicitly specify the model by using a string that starts with 'local:' followed by the name of your model. For example, if your model's name is my_local_model, you would use resolve_embed_model(embed_model='local:my_local_model').

Make sure your local model is supported by checking if it's compatible with the expected model types in LlamaIndex. The system either expects an instructor model listed in INSTRUCTOR_MODELS or will default to attempting to load it as a HuggingFace model. If your model doesn't fit these criteria, LlamaIndex might not recognize it as intended.

Additionally, review your environment setup to ensure it's not inadvertently configured to default to OpenAI's model. This includes checking environment variables that might influence the model selection process.

For GPU utilization, LlamaIndex leverages Nvidia's TensorRT for efficient inference on Nvidia GPUs. Ensure your setup meets the necessary dependencies and environment requirements for running local LLMs on a GPU.

Here's a quick example on how you might adjust your code:

def initialize_application():
    try:
        embed_model = resolve_embed_model(embed_model='local:my_local_model')
        # Use embed_model for your application
    except ValueError as e:
        print(f"Failed to initialize the local embedding model: {e}")

Replace 'my_local_model' with the actual name of your local model. This ensures you're explicitly requesting your local model and helps in troubleshooting if the model isn't recognized.

If you continue to face issues or have any further questions, feel free to ask for more assistance.

#### Details **Sources** - [How to use PandasQueryEngine with other LLM](https://github.com/run-llama/llama_index/discussions/9619) - [Is there a way to wrap LlamaIndex?](https://github.com/run-llama/llama_index/discussions/8772) - [[Question]: When customize the LLM, using VectorStoreIndex, it reports error](https://github.com/run-llama/llama_index/discussions/8368) - [[Question]: How to use custom model with VectorStoreIndex?](https://github.com/run-llama/llama_index/discussions/8398) - [[Tracking]: Better support for Local Models ](https://github.com/run-llama/llama_index/issues/6795) - [[Feature Request]: example for `LocalAI`](https://github.com/run-llama/llama_index/issues/7907) - [[Question]: How to use custom model with VectorStoreIndex?](https://github.com/run-llama/llama_index/issues/7489) - [[Question]: Asks for Open AI API key when using local LLM model and embeddings model](https://github.com/run-llama/llama_index/issues/7259) - [How can I search for local documents without using OpenAI?](https://github.com/run-llama/llama_index/issues/2240) - [Querying Index using local models with no access to download from the hub(s)](https://github.com/run-llama/llama_index/issues/3283) - [llama-index-legacy/llama_index/legacy/embeddings/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/embeddings/utils.py) - [llama-index-legacy/llama_index/legacy/llms/nvidia_tensorrt.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/nvidia_tensorrt.py) - [llama-index-integrations/llms/llama-index-llms-nvidia-tensorrt/llama_index/llms/nvidia_tensorrt/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-nvidia-tensorrt/llama_index/llms/nvidia_tensorrt/base.py) - [llama-index-integrations/llms/llama-index-llms-vllm/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-vllm/pyproject.toml) - [llama-index-integrations/llms/llama-index-llms-localai/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-localai/pyproject.toml)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich commented 3 months ago

@blackjackcola can you share some code? You'll need to make sure you setup both an embed model and an LLM

blackjackcola commented 3 months ago

Yes of course.

Context: As of now I tried to set up the variables globally but also didn t help. What I don t understand is that it works fine for the ingestion but not for the query/chat engine. I 've also tried to set embed model to 'local' even though I dont really need it since I dont use node-postprocessing atm. But there is probably a dumb mistake I am making. Aprecciate any kind of help !

CODE:

import logging import sys import torch from transformers import AutoTokenizer, BitsAndBytesConfig from llama_index.core.settings import Settings from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core import StorageContext, load_index_from_storage from llama_index.core.memory import ChatMemoryBuffer from llama_index.llms.huggingface import HuggingFaceLLM

Configure logging

logging.basicConfig(stream=sys.stdout, level=logging.INFO) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

Configuration for model quantization

quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, )

Initialize HuggingFace LLM with specific settings

llm = HuggingFaceLLM( model_name="meta-llama/Meta-Llama-3-8B", tokenizer_name="meta-llama/Meta-Llama-3-8B", context_window=3900, max_new_tokens=256, model_kwargs={"quantization_config": quantization_config}, generate_kwargs={"temperature": 0.1, "top_k": 50, "top_p": 0.95}, device_map="cuda", )

embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", embed_batch_size=100 )

Load the existing index

storage_context_vector = StorageContext.from_defaults(persist_dir="./Tiny_Data_embed") index = load_index_from_storage(storage_context=storage_context_vector)

Set global settings

Settings.llm = llm Settings.embed_model = embed_model

Initialize chat memory buffer

memory = ChatMemoryBuffer.from_defaults(token_limit=4090)

Main function

if name == "main": logging.info("Loading existing embeddings and preparing for querying...\n")

primary_chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    llm=llm,
    system_prompt=(
        "You are a chatbot specifically designed to create Exudyn Code for Multibodysystem simulation:"
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=False
)

while True:
    query = input("Enter your query (or type 'quit' to stop): ").strip()
    if query.lower() == "quit":
        print("Exiting the query engine.")
        break

    try:
        response = primary_chat_engine.chat(query)
    except ValueError as e:
        if "Optimizer returned zero sentences" in str(e):
            logging.info(
                "Optimizer found zero relevant sentences. Initiating fallback chat engine..."
            )
            response = llm.complete(query)
        else:
            raise  # Re-raise the exception if it's not the one we're looking for
    except Exception as e:
        logging.error(f"Unexpected error occurred: {str(e)}")
        response = "An error occurred, please try again later."

    print("Response:", response)
logan-markewich commented 3 months ago

@blackjackcola you should set settings as soon as possible in your code.

Since you loaded the index before setting your settings, the index loaded with the openai defaults

blackjackcola commented 3 months ago

@logan-markewich how did I not think of that ?! Thank you for your help !