Not able to inference deepseek-coder-6.7b-instruct.Q5_K_M.gguf

System Info

OS: Apple M1 Max

Name: langchain Version: 0.0.349 Summary: Building applications with LLMs through composability Home-page: https://github.com/langchain-ai/langchain Author: Author-email: License: MIT Requires: aiohttp, async-timeout, dataclasses-json, jsonpatch, langchain-community, langchain-core, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity Required-by:

Who can help?

@hwchase17 @agola11

Information

[X] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[X] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

Steps to reproduce:

I have followed the instructions provided here : https://python.langchain.com/docs/integrations/llms/llamacpp. Though not able inference it correctly.

Model path : https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain, QAGenerationChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
n_gpu_layers = 1  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

llm = LlamaCpp(
    model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q2_K.gguf",
    n_gpu_layers=n_gpu_layers,
    max_tokens=2000,
    top_p=1,
    n_batch=n_batch,
    callback_manager=callback_manager,
    f16_kv=True,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm(
    "Question: Write python program to add two numbers ? Answer:"
)

Result: < """"""""""""""""""""""/"

Requesting you to look into it. Please let me know in case you need more information. Thank you.

I have tried the same model file with llama-cpp-python package and it works as expected. Please find below the code that I have tried:

import json
import time
from llama_cpp import Llama
n_gpu_layers = 1  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512
llm = Llama(model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q5_K_M.gguf" , chat_format="llama-2", n_gpu_layers=n_gpu_layers,n_batch=n_batch)

start_time = time.time()
pp = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an python language assistant."},
          {
              "role": "user",
              "content": "Write quick sort ."
          }
      ])

end_time = time.time()
print("execution time:", {end_time - start_time})
print(pp["choices"][0]["message"]["content"])

Output :

## Quick Sort Algorithm in Python
Here is a simple implementation of the quicksort algorithm in Python:

```python
def partition(arr, low, high):
    i = (low-1)         # index of smaller element
    pivot = arr[high]     # pivot

    for j in range(low , high):
        if   arr[j] <= pivot:
            i += 1
            arr[i],arr[j] = arr[j],arr[i]

    arr[i+1],arr[high] = arr[high],arr[i+1]
    return (i+1)

def quickSort(arr, low, high):
    if low < high:
        pi = partition(arr,low,high)
        quickSort(arr, low, pi-1)
        quickSort(arr, pi+1, high)

# Test the code
n = int(input("Enter number of elements in array: "))
print("Enter elements: ")
arr = [int(input()) for _ in range(n)]
quickSort(arr,0,n-1)
print ("Sorted array is:")
for i in range(n):
    print("%d" %arr[i]),
This code first defines a helper function `partition()` that takes an array and two indices. It then rearranges the elements of the array so that all numbers less than or equal to the pivot are on its left, while all numbers greater than the pivot are on its right. The `quickSort()` function is then defined which recursively applies this partitioning process until the entire array is sorted.

The user can input their own list of integers and the program will output a sorted version of that list.
[/code]

Conclusion
In conclusion, Python provides several built-in functions for sorting lists such as `sort()` or `sorted()` but it's also possible to implement quick sort algorithm from scratch using custom function. This can be useful in situations where you need more control over the sorting process or when dealing with complex data structures.

Expected behavior

It should inference the model just like the native llama-cpp-python package.

🤖

Hello @Antriksh29071989! I'm Dosu, a friendly bot here to assist you with any bugs, questions, and help you contribute to the LangChain repository. While we wait for a human maintainer, I'm here to give you a hand. Let's tackle this together!

Thank you for providing detailed information about your issue. Based on the information you've provided and the context from similar issues, I have a few suggestions that might help.

Firstly, it seems like you're using the LlamaCpp class from the LangChain library to load the model and perform inference. In your code, you've set the f16_kv parameter to True. This parameter is used to determine whether to use half-precision for the key/value cache. Using half-precision can help reduce the memory footprint and potentially speed up computations, but at the cost of a slight decrease in precision. This might be causing the unexpected output you're seeing.

In a similar issue titled Output using llamacpp is garbage, the problem was resolved by changing the default value passed for f16_kv. You might want to try setting f16_kv to False in your code and see if that resolves the issue.

Secondly, you've set the n_gpu_layers parameter to 1. This parameter determines the number of layers to be loaded into GPU memory. Loading more layers into the GPU memory can potentially speed up the inference process, but it will also consume more GPU memory. If your GPU memory is not sufficient, it might cause issues during inference. You might want to try adjusting this parameter based on your GPU's VRAM pool.

Here's how you can modify your code:

llm = LlamaCpp(
    model_path="../models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q2_K.gguf",
    n_gpu_layers=n_gpu_layers,
    max_tokens=2000,
    top_p=1,
    n_batch=n_batch,
    callback_manager=callback_manager,
    f16_kv=False,  # Change this to False
    verbose=True,  # Verbose is required to pass to the callback manager
)

Please try these suggestions and let me know if they help. If the issue persists, it would be helpful if you could provide more information about your environment, such as the version of Python you're using, the version of the llama_cpp library, and the specifications of your GPU.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Changed it to f16_kv=False. Still the same issue.

llm = LlamaCpp(
    model_path="..models/deepcoder-gguf/deepseek-coder-6.7b-instruct.Q2_K.gguf",
    n_gpu_layers=n_gpu_layers,
    max_tokens=2000,
    top_p=1,
    n_batch=n_batch,
    callback_manager=callback_manager,
    f16_kv=False,
    verbose=True,  # Verbose is required to pass to the callback manager
)

Output: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . @dosu-bot

Same issue. Unable to run Deepseek-6.7 and TheBloke/CodeLlama-7B-Python-GGUF successfully. The models get loaded but output is garbage when called with from langchain_community.llms import LlamaCpp

WizardCoder-7b is working fine though.

The reason is LangChain cover classes do not suppor instruction tuned models, which require special tokens.

All the prompts that you have in chain are transformed into simple string. And this string does not contain proper special tokens for chat.

LlamaCPP class of LangChain calls basic llama_cpp.Llama's call method. Not the "chat" Version of it

Solution: One have to create ChatLlamaCpp class, that invokes chat generation method

As well as to create possibility to pass messages objects through pipeline. Not their parsed strings

To address the issue with using instruction-tuned models like deepseek-coder-6.7b-instruct.Q5_K_M.gguf with LangChain's LlamaCpp class, you can implement a ChatLlamaCpp class that invokes the chat generation method. Here is an example implementation:

import json
import logging
from typing import Any, Dict, List, Mapping, Optional, Tuple

from langchain.schema import ChatGeneration, ChatResult
from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import AIMessage, BaseMessage, ChatMessage, FunctionMessage, HumanMessage, SystemMessage

logger = logging.getLogger(__name__)

def _convert_dict_to_message(_dict: Mapping[str, Any]) -> BaseMessage:
    role = _dict["role"]
    if role == "user":
        return HumanMessage(content=_dict["content"])
    elif role == "assistant":
        content = _dict.get("content") or ""
        if _dict.get("function_call"):
            _dict["function_call"]["arguments"] = json.dumps(_dict["function_call"]["arguments"])
            additional_kwargs = {"function_call": dict(_dict["function_call"])}
        else:
            additional_kwargs = {}
        return AIMessage(content=content, additional_kwargs=additional_kwargs)
    elif role == "system":
        return SystemMessage(content=_dict["content"])
    elif role == "function":
        return FunctionMessage(content=_dict["content"], name=_dict["name"])
    else:
        return ChatMessage(content=_dict["content"], role=role)

def _convert_message_to_dict(message: BaseMessage) -> dict:
    if isinstance(message, ChatMessage):
        message_dict = {"role": message.role, "content": message.content}
    elif isinstance(message, HumanMessage):
        message_dict = {"role": "user", "content": message.content}
    elif isinstance(message, AIMessage):
        message_dict = {"role": "assistant", "content": message.content}
        if "function_call" in message.additional_kwargs:
            message_dict["function_call"] = message.additional_kwargs["function_call"]
    elif isinstance(message, SystemMessage):
        message_dict = {"role": "system", "content": message.content}
    elif isinstance(message, FunctionMessage):
        message_dict = {"role": "function", "content": message.content, "name": message.name}
    else:
        raise ValueError(f"Got unknown type {message}")
    if "name" in message.additional_kwargs:
        message_dict["name"] = message.additional_kwargs["name"]
    return message_dict

class ChatLlamaCpp(BaseChatModel):
    """Chat model using the Llama C++ API."""

    client: Any  #: :meta private:

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        message_dicts, params = self._create_message_dicts(messages, stop)
        _params = {"messages": message_dicts}
        final_params = {**params, **kwargs, **_params}
        response = self.client.run(final_params).json()
        return self._create_chat_result(response)

    def _create_message_dicts(
        self, messages: List[BaseMessage], stop: Optional[List[str]]
    ) -> Tuple[List[Dict[str, Any]], Dict[str, Any]]:
        params = dict(self._client_params)
        if stop is not None:
            if "stop" in params:
                raise ValueError("`stop` found in both the input and default params.")
            params["stop"] = stop
        message_dicts = [_convert_message_to_dict(m) for m in messages]
        return message_dicts, params

    def _create_chat_result(self, response: Mapping[str, Any]) -> ChatResult:
        generations = []
        for res in response["choices"]:
            message = _convert_dict_to_message(res["message"])
            gen = ChatGeneration(
                message=message,
                generation_info=dict(finish_reason=res.get("finish_reason")),
            )
            generations.append(gen)
        return ChatResult(generations=generations)

    @property
    def _client_params(self) -> Mapping[str, Any]:
        """Get the parameters used for the client."""
        return {}

    @property
    def _llm_type(self) -> str:
        """Return type of chat model."""
        return "llama-cpp"

This ChatLlamaCpp class is designed to handle chat-specific interactions by converting messages to and from the required format and invoking the chat generation method of the Llama C++ API. Ensure that the client attribute is properly initialized to interact with the Llama C++ API ^[1]^[2].

#### Details

**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

langchain-ai / langchain