Closed Ichigo3766 closed 7 months ago
Hello @bloodsucker99, I am not sure that's possible on the server side since models have different prompts. So it might make sense to implement this on a client side, which converts the open AI schema (List of dicts) into a single prompt.
Yea i kind of suspected that doing it on server side would not be possible :(
Any chance you/anyone would be interested in building like a middle man for this? A python wrapper that just sits in middle would be cool
I had some time and started working on something. I will share the first version here. I would love to get feedback if you are willing to try out.
Id love to try it out. Also possible to communicate over discord? Would make it much easier :)
Okay, I rushed out the first version. It is in a package i started called easyllm.
Github: https://github.com/philschmid/easyllm Documentation: https://philschmid.github.io/easyllm/
The documentation also includes examples for streaming
Install EasyLLM via pip:
pip install easyllm
Then import and start using the clients:
from easyllm.clients import huggingface
from easyllm.prompt_utils import build_llama2_prompt
# helper to build llama2 prompt
huggingface.prompt_builder = build_llama2_prompt
response = huggingface.ChatCompletion.create(
model="meta-llama/Llama-2-70b-chat-hf",
messages=[
{"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
{"role": "user", "content": "What is the sun?"},
],
temperature=0.9,
top_p=0.6,
max_tokens=256,
)
print(response)
This is interesting. Could you give me an example of connecting this to the TGI api? There is a model space but that would be loading the model again right? So instead if I am usin TGI which has the model loaded, how would i use its api in here and get a openai api out?
No, its a client. How would add a Wrapper when you don't know the prompt format on the server side?
It might be possible the write a different server.rs
which implements common templating and you could define what you want when starting it, but that's a lot of work.
Hi! I am a bit confused of what you mean by "you don't know the prompt format on the server side". So there is a wrapper made by langchain: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/huggingface_text_gen_inference.py
I was kind of thinking this way but for openai if that makes sense.
you don't know the prompt format on the server side"
I think what @philschmid meant, is how are you supposed to send a final fully formed id sequence. TGI doesn't know how model where trained/fine-tuned, so it doesn't know what system prompt or user_prompt is. It expects a single full string, as the langchain wrapper sends.
So the missing step is going from
messages=[
{"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
{"role": "user", "content": "What is the sun?"},
],
To:
[[SYS]\nYou are a helpful assistant speaking like a pirate. argh[/SYS] What is the sun <s>
Which is needed for good results with https://huggingface.co/meta-llama/Llama-2-7b-chat-hf for instance (don't quote me on the prompt I did it from memory)
I agree with @Narsil point. Some people or projects don't use OpenAI-style prompts. Eventually, all messages will be merged into a single string for input to LLM, limiting flexibility. One possible solution is to create an API template on the server side, allowing users to define their preferred API. However, implementing this approach might require a substantial amount of work and could potentially introduce bugs.
I have a question: Why is the TGI API slightly different from the TGI client SDK? For instance, the parameter 'detail' is ignored in the TGI client source code. Shouldn't they be exactly the same?
Why is the TGI API slightly different from the TGI client SDK?
I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.
One possible solution is to create an API template on the server side
That's definitely an option, which I would like with guidance and token healing if we were to do it, they seem to serve the same purpose: extending the querying API in a user defined way. (Both the server user and the actual querying user)
I've implemented a small wrapper around the chat completions for llama2.
The easyllm
from @philschmid seems good, and I've compared it with my implementation for llama2 and it gives the same result!
Why is the TGI API slightly different from the TGI client SDK?
I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.
here is what I'm referring to here. The request parameters is slight different from ones in API. It's ok but why the 'detail' is manually set to True here
In case this is helpful, llama.cpp does this via api_like_OAI.py. This PR would update that script to use fastchat's conversation.py to handle the serialization problem discussed upthread.
And here is fastchat's own version of this: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py
I would love to have this to be supported.
LiteLLM has support for TGI: https://docs.litellm.ai/docs/providers/huggingface#text-generation-interface-tgi---llms
Thanks for mentioning us @abhinavkulkarni
Hey @Narsil @jcushman @zfang
Happy to help here.
This is the basic code:
import os
from litellm import completion
# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"
messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]
# e.g. Call 'WizardLM/WizardCoder-Python-34B-V1.0' hosted on HF Inference endpoints
response = completion(model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0", messages=messages, api_base="https://my-endpoint.huggingface.cloud")
print(response)
We also handle prompt formatting - https://docs.litellm.ai/docs/providers/huggingface#models-with-prompt-formatting based on the lmsys/fastchat implementation.
But you can overwrite this with your own changes if necessary - https://docs.litellm.ai/docs/providers/huggingface#custom-prompt-templates
Hi @krrishdholakia,
Thanks for the info. Instead of a client I actually a middle service because I'm trying to host an API server for the chatbot arena https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model
I can use vLLM to host a service and provide an OpenAI-compatible API but it's quite slower than TGI. It pains me that TGI doesn't support this. I will probably need to hack a FastChat service to redirect calls to TGI.
Regards,
Felix
@zfang we have an open-source proxy you can fork and run this through - https://github.com/BerriAI/liteLLM-proxy
would it be helpful if we exposed a cli command to deploy this through?
litellm --deploy
LiteLLM has developed an OpenAI wrapper for TGI (and for lots of other model-serving frameworks).
Here are more details: https://docs.litellm.ai/docs/proxy_server
You can set it up as follows:
Set up a local TGI endpoint first:
$ text-generation-launcher
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 5376 --max-total-tokens 6144 --max-batch-prefill-tokens 6144 \
--quantize awq
I have a LiteLLM proxy server on top of that.
$ litellm \
--model huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
--api_base http://localhost:8080
I am able to successfully obtain responses from openai.ChatCompletion.create endpoint as follows:
>>> import openai
>>> openai.api_key = "xyz"
>>> openai.api_base = "http://0.0.0.0:8000"
>>> model = "huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq"
>>> completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": "How are you?"}])
>>> print(completion)
{
"object": "chat.completion",
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": "I'm fine, thanks. I'm glad to hear that.\n\nI'm",
"role": "assistant",
"logprobs": -18.19830319
}
}
],
"id": "chatcmpl-7f8f5312-893a-4dab-aff5-3a97a354c2be",
"created": 1695869575.316254,
"model": "abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq",
"usage": {
"prompt_tokens": 4,
"completion_tokens": 15,
"total_tokens": 19
}
}
@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.
Hey @michaelfeil - is TGI close-source now? I can't find other info on this
We added a restriction in 1.0 which means you cannot use it as a cloud provider as-is without getting a license from us. Most likely it doesn't change anything for you.
More details here: https://github.com/huggingface/text-generation-inference/issues/744
@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.
Can I use this to wrap the official inference API as published by hf? I can't seem to be able to find an example of how to do create models using the hf inference api from llamaindex.
@abhinavkulkarni hi, How to run a LiteLLM proxy server
I use litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://0.0.0.0:8080/generate
in linux command, but it appears bash: litellm: command not found
@yanmengxiang1: Please install litellm
using pip
.
@abhinavkulkarni yes, I know it. Should I use something like flask
to wrappe this TGI?
Hey @yanmengxiang1:
Run TGI at port 8080
. Then run litellm
so that it points to TGI:
litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://localhost:8080 --port 8000
You now have OpenAI compatible API endpoint at port 8000
.
@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server
@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server
It seems this feature is going to be depreciated? So how future proof is it to build an application around it?
Hey @LarsHill the LiteLLM community is discussing the best approach right now - https://github.com/BerriAI/litellm/discussions/648#discussioncomment-7375276
Some context We'd initially planned on the docker container being an easier replacement ( consistent environment + easier to deploy)
but it might not be ideal. So we're trying to understand what works best (how do you provide a consistent experience + easy ability to set up configs, etc.).
DM'ing you to understand what a good experience here looks like.
@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server
It seems this feature is going to be depreciated? So how future proof is it to build an application around it?
The text-generation-webui made a huge progress on supporting other providers by including extensions, you can serve the compatible openai's API using this command:
# clone the repo, then cd..
# install deps:
!pip install -q -r requirements.txt --upgrade
# install extensions (openai...)
!pip install -q -r extensions/openai/requirements.txt --upgrade
# download your model (using way allows you to download large models):
!python download-model.py https://huggingface.co/TheBloke/SauerkrautLM-UNA-SOLAR-Instruct-GPTQ
# this one works better for MemGPT
# serve your model (check the name of the download file/directory):
!python server.py --model TheBloke_SauerkrautLM-UNA-SOLAR-Instruct-GPTQ --n-gpu-layers 24 --n_ctx 2048 --api --nowebui --extensions openai
# or download a specific file (if using GGUF models):
!python download-model.py https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF --specific-file dolphin-2.7-mixtral-8x7b.Q2_K.gguf
Your server should be up and running on 5000
(by default):
!curl http://0.0.0.0:5000/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "This is a cake recipe:\n\n1.","max_tokens": 200, "temperature": 1, "top_p": 0.9, "seed": 10 }'
This way allows you to run any model (even those aren't available on Ollama's docker images), and without hitting huggingface's api and even large models (>= 10Gb) and even if models without an inference API. No litellm
neither ollama
required.
This functionality is now supported in TGI with the introduction of the Message API and can be used like:
from openai import OpenAI
# init the client but point it to TGI
client = OpenAI(
base_url="http://localhost:3000/v1",
api_key="-"
)
chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a helpful assistant." },
{"role": "user", "content": "What is deep learning?"}
],
stream=False
)
print(chat_completion)
Please see the docs here for more details https://huggingface.co/docs/text-generation-inference/messages_api
Feature request
Hi,
I was wondering if it would be possible to have a openai based api.
Motivation
Many projects have been built around openai api something similar to what vllm has and few others inference servers have. If TGI can have this, we can just swap the base url for any projects such as aider and many more and use them without any hastle of changing the code.
https://github.com/paul-gauthier/aider https://github.com/AntonOsika/gpt-engineer https://github.com/Significant-Gravitas/Auto-GPT
And many more.
For reference, vllm has a wrapper and text-generation webui has one too.
Your contribution
discuss.