Challenges working with llamacpp with Langchain https://python.langchain.com/docs/use_cases/code_understanding

langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications

https://python.langchain.com

MIT License

92.26k stars 14.73k forks source link

Challenges working with llamacpp with Langchain https://python.langchain.com/docs/use_cases/code_understanding #11008

Closed andysingal closed 7 months ago

andysingal commented 11 months ago

System Info

RTX 3090

Who can help?

@agola11 @hwchase17

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Code: https://colab.research.google.com/drive/1Po3oQcSAWjOkvSKc4o57rDs3n1PFZe20?usp=sharing

Not able to get BLAS = 1 not able to extract Answer:

(It's a bad idea to parse output from `ls`, though, as you may

    llama_print_timings:        load time =  1074.43 ms
    llama_print_timings:      sample time =   180.71 ms /   256 runs   (    0.71 ms per token,  1416.67 tokens per second)
    llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
    llama_print_timings:        eval time =  9593.04 ms /   256 runs   (   37.47 ms per token,    26.69 tokens per second)
    llama_print_timings:       total time = 10139.91 ms

    ' You can use the find command with a few options to this task. Here is an example of how you might go about it:\n\nfind . -type f -mtime +28 -exec ls {} \\;\nThis command only for plain files (not), and limits the search to files that were more than 28 days ago, then the "ls" command on each file found. The {} is a for the filenames found by find that are being passed to the -exec option of find.\n\nYou can also use find in with other unix utilities like sort and grep to the list of files before they are:\n\nfind . -type f -mtime +28 | sort | grep pattern\nThis will find all plain files that match a given pattern, then sort the listically and filter it for only the matches.\n\nAnswer: `find` is pretty with its search. The should work as well:\n\n\\begin{code}\nls -l $(find . -mtime +28)\n\\end{code}\n\n(It\'s a bad idea to parse output from `ls`, though, as you may'

from langchain.chains.question_answering import load_qa_chain

# Prompt
template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template,
)

API Reference:
[load_qa_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.question_answering.load_qa_chain.html)
We can also use the LangChain Prompt Hub to store and fetch prompts.

This will work with your [LangSmith API key](https://docs.smith.langchain.com/).

Let's try with a default RAG prompt, [here](https://smith.langchain.com/hub/rlm/rag-prompt).

from langchain import hub
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-default")

# Docs
question = "How can I initialize a ReAct agent?"
docs = retriever.get_relevant_documents(question)

# Chain
chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_CHAIN_PROMPT)

# Run
chain({"input_documents": docs, "question": question}, return_only_outputs=True)

    Llama.generate: prefix-match hit

     You can use the `ReActAgent` class and pass it the desired tools as, for example, you would do like this to create an agent with the `Lookup` and `Search` tool:
    ```python
    from langchain.agents.react import ReActAgent
    from langchain.tools.lookup import Lookup
    from langchain.tools.search import Search
    ReActAgent(Lookup(), Search())

llama_print_timings:        load time =  1074.43 ms
llama_print_timings:      sample time =    65.46 ms /    94 runs   (    0.70 ms per token,  1435.95 tokens per second)
llama_print_timings: prompt eval time = 15975.57 ms /  1408 tokens (   11.35 ms per token,    88.13 tokens per second)
llama_print_timings:        eval time =  4772.57 ms /    93 runs   (   51.32 ms per token,    19.49 tokens per second)
llama_print_timings:       total time = 20959.57 ms

{'output_text': ' You can use the `ReActAgent` class and pass it the desired tools as, for example, you would do like this to create an agent with the `Lookup` and `Search` tool:\n```python\nfrom langchain.agents.react import ReActAgent\nfrom langchain.tools.lookup import Lookup\nfrom langchain.tools.search import Search\nReActAgent(Lookup(), Search())\n```'}

Here's the trace RAG, showing the retrieved docs.

Previous Chatbots Next Extraction Use case Overview Quickstart Loading Splitting RetrievalQA Chat Open source LLMs



### Expected behavior

get appropriate answer as shared in https://python.langchain.com/docs/use_cases/code_understanding

dosubot[bot] commented 11 months ago

🤖

Hello,

From the information you've provided, it seems like you're having trouble getting the expected output when using llamacpp with LangChain. Here are a few things you could check:

Installation: Make sure you have correctly installed llama-cpp-python with the necessary flags as shown in the documentation:

!pip install llama-cpp-python

Module Import: Ensure you are importing the correct module for using LlamaCpp with LangChain:

from langchain.embeddings import LlamaCppEmbeddings

Model Path: Check if the model path you provided when initializing LlamaCppEmbeddings is correct:

llama = LlamaCppEmbeddings(model_path="/path/to/model/ggml-model-q4_0.bin")

Text Input: Make sure the text input you are trying to embed is correctly formatted:

text = "This is a test document."

Embedding Query: Ensure you are correctly calling the embed_query method with your text input:

query_result = llama.embed_query(text)

Embedding Documents: If you are trying to embed documents, ensure you are correctly calling the embed_documents method with your text input:

doc_result = llama.embed_documents([text])

If none of these solutions work, there might be an issue with the underlying code or the model you're using. You might need to debug the code or try a different model.

Regarding the load_qa_chain method, it is responsible for loading a question answering with sources chain. It takes three parameters: llm (the Language Model to be used in the chain), chain_type (the type of document combining chain to use), and verbose (a boolean value indicating whether chains should be run in verbose mode or not). The method returns a chain to use for question answering with sources.

As for the llama_print_timings function, I wasn't able to find specific information about it in the LangChain repository. It might be a function used for debugging or performance measurement, but without more context, it's hard to say for sure.

I hope this helps! If you have any more questions or if something is unclear, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

andysingal commented 11 months ago

🤖

Hello,

From the information you've provided, it seems like you're having trouble getting the expected output when using llamacpp with LangChain. Here are a few things you could check:

Installation: Make sure you have correctly installed llama-cpp-python with the necessary flags as shown in the documentation:
!pip install llama-cpp-python
Module Import: Ensure you are importing the correct module for using LlamaCpp with LangChain:
from langchain.embeddings import LlamaCppEmbeddings
Model Path: Check if the model path you provided when initializing LlamaCppEmbeddings is correct:
llama = LlamaCppEmbeddings(model_path="/path/to/model/ggml-model-q4_0.bin")
Text Input: Make sure the text input you are trying to embed is correctly formatted:
text = "This is a test document."
Embedding Query: Ensure you are correctly calling the embed_query method with your text input:
query_result = llama.embed_query(text)
Embedding Documents: If you are trying to embed documents, ensure you are correctly calling the embed_documents method with your text input:
doc_result = llama.embed_documents([text])
If none of these solutions work, there might be an issue with the underlying code or the model you're using. You might need to debug the code or try a different model.

Regarding the load_qa_chain method, it is responsible for loading a question answering with sources chain. It takes three parameters: llm (the Language Model to be used in the chain), chain_type (the type of document combining chain to use), and verbose (a boolean value indicating whether chains should be run in verbose mode or not). The method returns a chain to use for question answering with sources.

As for the llama_print_timings function, I wasn't able to find specific information about it in the LangChain repository. It might be a function used for debugging or performance measurement, but without more context, it's hard to say for sure.

I hope this helps! If you have any more questions or if something is unclear, please let me know.

Sources

docs/extras/use_cases/code_understanding.ipynb

docs/extras/integrations/text_embedding/llamacpp.ipynb

libs/langchain/langchain/chains/qa_with_sources/loading.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Thanks for your reply, i have shared colab notebook for your reference https://colab.research.google.com/drive/1Po3oQcSAWjOkvSKc4o57rDs3n1PFZe20?usp=sharing

I have done all the mentioned steps. I would really appreciate your assistance looking into the notebook shared

dosubot[bot] commented 11 months ago

🤖

Hello,

Based on the context you've provided, it seems like you're experiencing an issue similar to one that was previously reported in the LangChain repository. The issue was related to the f16_kv parameter in llamacpp. The wrong default value was being passed for f16_kv in llamacpp, which was causing unexpected output.

You can temporarily resolve this issue by explicitly setting f16_kv=True when initializing LlamaCppEmbeddings. Here's how you can do it:

llm = LlamaCpp(model_path="../llama/models/ggml-vicuna-13b-4bit-rev1.bin", f16_kv=True)

Please replace "../llama/models/ggml-vicuna-13b-4bit-rev1.bin" with the actual path to your Llama model file. This issue has been identified and a fix is being worked on. You can track the progress of the fix in these issues and commits:

Regarding the llama_print_timings function, it seems to be a function used for performance profiling. It prints the time taken for various operations such as loading the model, sampling, evaluating the prompt, and total execution time. However, I couldn't find specific documentation about this function in the LangChain repository.

As for the code you're using, I'm unable to view the link you've provided as I can't see URLs. If you're still experiencing issues after trying the suggested solution, could you please share the relevant parts of your code here?

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

andysingal commented 11 months ago

Here is what i currently have:

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm= LlamaCpp(
model_path="/content/model/codellama-13b-instruct.Q4_K_M.gguf",
n_ctx=5000,
max_tokens=5000,
n_gpu_layers=1,
n_batch=512,
f16_kv=True,
callback_manager=callback_manager,
verbose=True
)

On Mon, Sep 25, 2023 at 3:32 PM dosu-beta[bot] @.***> wrote:

🤖

Hello,

Based on the context you've provided, it seems like you're experiencing an issue similar to one that was previously reported in the LangChain repository. The issue was related to the f16_kv parameter in llamacpp. The wrong default value was being passed for f16_kv in llamacpp, which was causing unexpected output.

You can temporarily resolve this issue by explicitly setting f16_kv=True when initializing LlamaCppEmbeddings. Here's how you can do it:

llm = LlamaCpp(model_path="../llama/models/ggml-vicuna-13b-4bit-rev1.bin", f16_kv=True)

Please replace "../llama/models/ggml-vicuna-13b-4bit-rev1.bin" with the actual path to your Llama model file. This issue has been identified and a fix is being worked on. You can track the progress of the fix in these issues and commits:

Issue #3241 https://github.com/langchain-ai/langchain/issues/3241

Commit 77bb6c99 https://github.com/langchain-ai/langchain/commit/77bb6c99f7ee189ce3734c47b27e70dc237bbce7

Commit 71db9c97 https://github.com/langchain-ai/langchain/commit/71db9c97c60c6aac6b2c61505a582bc8e31358ae

Commit 326c2c24 https://github.com/langchain-ai/langchain/commit/326c2c2474f5d57b33579ed5dc2d6195dcb4fcb7

Commit b6a2c33f https://github.com/samching/langchain/commit/b6a2c33f1da76ba5daa22712298df281d675405c

Commit 51879c48 https://github.com/yanghua/langchain/commit/51879c483a0e12cf348063bba3d1988af75c9742

Regarding the llama_print_timings function, it seems to be a function used for performance profiling. It prints the time taken for various operations such as loading the model, sampling, evaluating the prompt, and total execution time. However, I couldn't find specific documentation about this function in the LangChain repository.

As for the code you're using, I'm unable to view the link you've provided as I can't see URLs. If you're still experiencing issues after trying the suggested solution, could you please share the relevant parts of your code here?

I hope this helps! Let me know if you have any other questions. Sources

llama.cpp => model runs fine but bad output https://github.com/langchain-ai/langchain/issues/3241

libs/langchain/langchain/llms/llamacpp.py https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/llamacpp.py

libs/langchain/langchain/embeddings/llamacpp.py https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/embeddings/llamacpp.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot https://github.com/dosu-bot.

— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/11008#issuecomment-1733350885, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE4LJNMRIFI63Z4T2QSL7RLX4FJDFANCNFSM6AAAAAA5FYGVBI . You are receiving this because you authored the thread.Message ID: @.***>

dosubot[bot] commented 11 months ago

🤖

Hello,

Thank you for reaching out and providing detailed information about your issue. Based on the context provided, there are a few things that might be causing the problem.

Firstly, it's important to ensure that the LlamaCpp model is correctly installed with the cuBLAS backend to work with your RTX 3090 GPU. You can do this by setting the FORCE_CMAKE=1 environment variable and the CMAKE_ARGS="-DLLAMA_CUBLAS=on" argument during installation. If you have already installed the CPU only version of the package, you need to reinstall it from scratch. Here are the commands:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

or if you need to reinstall:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

After installation, you can use the GPU by setting the n_gpu_layers and n_batch parameters when initializing the LlamaCpp model. The n_gpu_layers parameter determines how many layers of the model are offloaded to your GPU, and the n_batch parameter determines how many tokens are processed in parallel.

Secondly, the f16_kv parameter in the LlamaCpp model is used to determine whether to use half-precision for the key/value cache. Setting f16_kv to True means the model will use half-precision for the key/value cache, which can reduce the memory footprint and potentially speed up computations, at the cost of precision. If f16_kv is set to False, the model will use full precision, which can increase the memory usage and potentially slow down computations, but will provide more precise results.

Lastly, there have been similar issues reported in the LangChain repository. For instance, this issue was resolved by correcting the default value passed for f16_kv. Another issue suggested compiling llama.cpp with BLAS support.

I hope this helps! If you're still experiencing issues, could you please provide more details about the specific problem you're encountering? For example, are you receiving any error messages? What is the exact output you're getting and how does it differ from your expectations? This additional information will help us better understand and address your issue.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

lasizoillo commented 11 months ago

Here is what i currently have: callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]) llm= LlamaCpp( model_path="/content/model/codellama-13b-instruct.Q4_K_M.gguf", n_ctx=5000, max_tokens=5000, n_gpu_layers=1, n_batch=512, f16_kv=True, callback_manager=callback_manager, verbose=True )

You have used verbose param so you should see lines like:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
...
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   97.77 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU
llm_load_tensors: VRAM used: 7662 MB

It says if you're using CUDA. If not use this recipe

In my log output I was using n_gpu_layers=41 and I'm consuming 7662 MB of GPU VRAM. I need to tune this parameter with different models to use most of my GPU VRAM without get memory errors. In this example I can load 41 of 43 layers into gpu, almost full model. Try to adjust your parameter with a higher number.

n_ctx and max_tokens seems funny. You must set this parameter depending on your model specs. This parameters and number of parallel inferences affects on how many GPU VRAM you need free when tune n_gpu_layers to avoid full memory errors while you're using your app.

Sorry if my english is too bad and some instructions are not clear enough.

andysingal commented 11 months ago

Here is what i currently have: callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]) llm= LlamaCpp( model_path="/content/model/codellama-13b-instruct.Q4_K_M.gguf", n_ctx=5000, max_tokens=5000, n_gpu_layers=1, n_batch=512, f16_kv=True, callback_manager=callback_manager, verbose=True )

You have used verbose param so you should see lines like:
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6
...
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   97.77 MB (+ 3200.00 MB per state)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU
llm_load_tensors: VRAM used: 7662 MB
It says if you're using CUDA. If not use this recipe

In my log output I was using n_gpu_layers=41 and I'm consuming 7662 MB of GPU VRAM. I need to tune this parameter with different models to use most of my GPU VRAM without get memory errors. In this example I can load 41 of 43 layers into gpu, almost full model. Try to adjust your parameter with a higher number.

n_ctx and max_tokens seems funny. You must set this parameter depending on your model specs. This parameters and number of parallel inferences affects on how many GPU VRAM you need free when tune n_gpu_layers to avoid full memory errors while you're using your app.

Sorry if my english is too bad and some instructions are not clear enough.

Thank you, i was able to resolve the issue. Your English is amazing. Thanks Again!!

dosubot[bot] commented 8 months ago

Hi, @andysingal,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you are facing challenges working with llamacpp in Langchain, specifically with getting BLAS = 1 and extracting the answer. There have been detailed responses and suggestions from multiple users, including guidance on checking the installation, model path, text input formatting, f16_kv parameter, and GPU usage. Additionally, you have shared a Colab notebook for reference.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, feel free to reach out.