[Question]: Using llama.cpp serve hosted model

JulianOestreich90 commented 6 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I am using llama.cpp http server to serve a model with a docker container: docker run -d -p ipadress:8080:8080 -v ~/models:/models --gpus all ghcr.io/ggerganov/llama.cpp:server-cuda -m models/mistral-7b-instruct-v0.2.Q4_0.gguf -c 32000 --host 0.0.0.0 --port 8080 --n-gpu-layers 99

I can talk with the model via curl, e.g.:

curl --request POST \                        
    --url http://ipadress:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

And it returns a response.

However integrating it into llama index does't seem so easy. I tried these two options:

import os
from llama_index.llms.openllm import OpenLLMAPI

os.environ["OPENLLM_ENDPOINT"] = "http://ipadress:8080"  # replace with your server address
remote_llm = OpenLLMAPI()

response = remote_llm.complete("Hello World!")
print(str(response))

##########################################################

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(api_base="http://ipadress:8080")

response = llm.complete("Hello World!")
print(str(response))

The OpenLLMAPI Options errors with AttributeError: 'HTTPStatusError' object has no attribute 'message' and the OpenAILike class also gives connection errors.

What can i do to connect properly to a self hosted remote llm?

dosubot[bot] commented 6 months ago

To address the issues you're facing with connecting to a self-hosted remote LLM using llama.cpp within a Docker container, here are some focused steps and clarifications:

Error Handling Update: The AttributeError: 'HTTPStatusError' object has no attribute 'message' occurs because the .message attribute is deprecated in Python 3. Instead of exception.message, use str(exception) or exception.args to access the error message. This adjustment is necessary for both OpenLLMAPI and OpenAILike classes when handling exceptions.
Using LlamaCPP with Docker: Ensure you have the llama-cpp-python package installed in your environment. This package is crucial for interfacing with llama.cpp models. If it's not installed, you can add it using pip install llama-cpp-python.
Model Configuration: When setting up your model, you can specify either a model_path for a pre-downloaded model or a model_url for dynamic downloading. Adjust your Docker setup accordingly to ensure the model is accessible or the container can download the model.
Text Generation Parameters: Customize parameters like temperature, max_new_tokens, and context_window to control the model's output. Use the LlamaCPP class for setting these parameters and generating text.
Docker Container Setup: Verify that your Docker container is correctly configured for model access (either through a local file or internet access for downloading) and that the llama-cpp-python package is installed within the container.
Advanced Features: Explore LlamaCPP class's advanced features for applications like real-time chat interactions.

For detailed implementation, refer to the LlamaCPP class documentation and ensure your code is updated to handle exceptions correctly. This should help you successfully connect to and interact with your self-hosted LLM.

#### Details

**Sources** - [llama-index-integrations/llms/llama-index-llms-openllm/llama_index/llms/openllm/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-openllm/llama_index/llms/openllm/base.py) - [llama-index-integrations/llms/llama-index-llms-openai-like/llama_index/llms/openai_like/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-openai-like/llama_index/llms/openai_like/base.py) - [llama-index-legacy/llama_index/legacy/llms/llama_cpp.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/llama_cpp.py) - [llama-index-legacy/llama_index/legacy/llms/openllm.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/openllm.py) - [llama-index-integrations/llms/llama-index-llms-llama-cpp/llama_index/llms/llama_cpp/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-llama-cpp/llama_index/llms/llama_cpp/base.py) - [llama-index-legacy/llama_index/legacy/llms/openai_like.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/openai_like.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

JulianOestreich90 commented 6 months ago

@dosu why should i install llama-cpp-python on my server? Does that mean llama index can not handle the API spawned by the llama.cpp http server? It is supposed to be an OpenAI compatible API as well.

dosubot[bot] commented 6 months ago

The issue with the AttributeError: 'HTTPStatusError' object has no attribute 'message' you're encountering is due to the way exceptions are handled in the OpenLLMAPI class. Modern Python versions and HTTP client libraries don't use a .message attribute on exception objects. Instead, you should use str(exception) or access specific attributes like .status_code or .detail to get the error information. This adjustment is necessary for proper error handling in your code.

Regarding the OpenAILike class connection errors, these could stem from various configuration issues such as incorrect API keys, missing dependencies, or incompatible model metadata. Ensure all configurations and dependencies are correctly set up.

For your question about installing llama-cpp-python on your server, the LlamaCPP class within llama_index is designed to work directly with the llama-cpp library, not just through an HTTP server API. This direct integration requires the llama-cpp-python package to be installed on your server for the LlamaCPP class to function correctly. It uses the Llama class from the llama_cpp library for model initialization, completions, and other tasks, which is why the package is necessary even if the llama.cpp HTTP server provides an OpenAI-compatible API.

#### Details

**Sources** - [Integration with llama.cpp](https://github.com/run-llama/llama_index/issues/898) - [llama-index-legacy/llama_index/legacy/llms/llama_cpp.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/llama_cpp.py) - [llama-index-integrations/llms/llama-index-llms-llama-cpp/llama_index/llms/llama_cpp/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-llama-cpp/llama_index/llms/llama_cpp/base.py) - [llama-index-integrations/llms/llama-index-llms-openllm/llama_index/llms/openllm/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-openllm/llama_index/llms/openllm/base.py) - [llama-index-integrations/llms/llama-index-llms-openai-like/llama_index/llms/openai_like/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-openai-like/llama_index/llms/openai_like/base.py) - [llama-index-legacy/llama_index/legacy/llms/openllm.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/openllm.py) - [llama-index-legacy/llama_index/legacy/llms/openai_like.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/llms/openai_like.py) - [llama-index-integrations/llms/llama-index-llms-llama-cpp/pyproject.toml](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-llama-cpp/pyproject.toml) - [llama-index-integrations/llms/llama-index-llms-llama-cpp/BUILD](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-llama-cpp/BUILD)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

RussellLuo commented 6 months ago

Hi @JulianOestreich90, in addition to the API you provided, llama.cpp also supports OpenAI-compatible Chat Completions API (search POST /v1/chat/completions). If you use that API, I guess OpenAI should suffice.

jggc commented 4 months ago

Thanks for the suggestion of OpenAI, it did work for me although I've had to mess with the parameters a bit. I ended up with :

llm = OpenAI(api_key="somestring", api_base="http://192.168.12.132:3300")

Leaving the API Key empty breaks it at the moment, it is trying to do authentication with an invalid query.

See this girst for my full RAG script using a single RTX 3070 running llama 8B served with llama.cpp server. I've had to put the embeddings model on CPU since my GPU does not have enough memory.

It's quite satisfying to watch. I randomly picked a few wikipedia articles about giraffes and related concepts for my test, I put the result in as a gist comment.

https://gist.github.com/jggc/5ba572a4753bc851e6d33d07e82b39f4

run-llama / llama_index

[Question]: Using llama.cpp serve hosted model #13107

Question Validation

Question