huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.98k stars 1.06k forks source link

OpenAI API Wrapper #735

Closed Ichigo3766 closed 7 months ago

Ichigo3766 commented 1 year ago

Feature request

Hi,

I was wondering if it would be possible to have a openai based api.

Motivation

Many projects have been built around openai api something similar to what vllm has and few others inference servers have. If TGI can have this, we can just swap the base url for any projects such as aider and many more and use them without any hastle of changing the code.

https://github.com/paul-gauthier/aider https://github.com/AntonOsika/gpt-engineer https://github.com/Significant-Gravitas/Auto-GPT

And many more.

For reference, vllm has a wrapper and text-generation webui has one too.

Your contribution

discuss.

philschmid commented 1 year ago

Hello @bloodsucker99, I am not sure that's possible on the server side since models have different prompts. So it might make sense to implement this on a client side, which converts the open AI schema (List of dicts) into a single prompt.

Ichigo3766 commented 1 year ago

Yea i kind of suspected that doing it on server side would not be possible :(

Any chance you/anyone would be interested in building like a middle man for this? A python wrapper that just sits in middle would be cool

philschmid commented 1 year ago

I had some time and started working on something. I will share the first version here. I would love to get feedback if you are willing to try out.

Ichigo3766 commented 1 year ago

Id love to try it out. Also possible to communicate over discord? Would make it much easier :)

philschmid commented 1 year ago

Okay, I rushed out the first version. It is in a package i started called easyllm.

Github: https://github.com/philschmid/easyllm Documentation: https://philschmid.github.io/easyllm/

The documentation also includes examples for streaming

Example

Install EasyLLM via pip:

pip install easyllm

Then import and start using the clients:


from easyllm.clients import huggingface
from easyllm.prompt_utils import build_llama2_prompt

# helper to build llama2 prompt
huggingface.prompt_builder = build_llama2_prompt

response = huggingface.ChatCompletion.create(
    model="meta-llama/Llama-2-70b-chat-hf",
    messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=256,
)

print(response)
Ichigo3766 commented 1 year ago

This is interesting. Could you give me an example of connecting this to the TGI api? There is a model space but that would be loading the model again right? So instead if I am usin TGI which has the model loaded, how would i use its api in here and get a openai api out?

philschmid commented 1 year ago

No, its a client. How would add a Wrapper when you don't know the prompt format on the server side? It might be possible the write a different server.rs which implements common templating and you could define what you want when starting it, but that's a lot of work.

Ichigo3766 commented 1 year ago

Hi! I am a bit confused of what you mean by "you don't know the prompt format on the server side". So there is a wrapper made by langchain: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/huggingface_text_gen_inference.py

I was kind of thinking this way but for openai if that makes sense.

Narsil commented 1 year ago

you don't know the prompt format on the server side"

I think what @philschmid meant, is how are you supposed to send a final fully formed id sequence. TGI doesn't know how model where trained/fine-tuned, so it doesn't know what system prompt or user_prompt is. It expects a single full string, as the langchain wrapper sends.

So the missing step is going from

 messages=[
        {"role": "system", "content": "\nYou are a helpful assistant speaking like a pirate. argh!"},
        {"role": "user", "content": "What is the sun?"},
    ],

To:

[[SYS]\nYou are a helpful assistant speaking like a pirate. argh[/SYS] What is the sun <s>

Which is needed for good results with https://huggingface.co/meta-llama/Llama-2-7b-chat-hf for instance (don't quote me on the prompt I did it from memory)

paulcx commented 1 year ago

I agree with @Narsil point. Some people or projects don't use OpenAI-style prompts. Eventually, all messages will be merged into a single string for input to LLM, limiting flexibility. One possible solution is to create an API template on the server side, allowing users to define their preferred API. However, implementing this approach might require a substantial amount of work and could potentially introduce bugs.

I have a question: Why is the TGI API slightly different from the TGI client SDK? For instance, the parameter 'detail' is ignored in the TGI client source code. Shouldn't they be exactly the same?

Narsil commented 1 year ago

Why is the TGI API slightly different from the TGI client SDK?

I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.

Narsil commented 1 year ago

One possible solution is to create an API template on the server side

That's definitely an option, which I would like with guidance and token healing if we were to do it, they seem to serve the same purpose: extending the querying API in a user defined way. (Both the server user and the actual querying user)

viniciusarruda commented 1 year ago

I've implemented a small wrapper around the chat completions for llama2. The easyllm from @philschmid seems good, and I've compared it with my implementation for llama2 and it gives the same result!

paulcx commented 1 year ago

Why is the TGI API slightly different from the TGI client SDK?

I'm not sure what you are referring to. There could be some slight out sync between the Python client and the server, but that's not intentional.

here is what I'm referring to here. The request parameters is slight different from ones in API. It's ok but why the 'detail' is manually set to True here

jcushman commented 1 year ago

In case this is helpful, llama.cpp does this via api_like_OAI.py. This PR would update that script to use fastchat's conversation.py to handle the serialization problem discussed upthread.

jcushman commented 1 year ago

And here is fastchat's own version of this: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py

zfang commented 1 year ago

I would love to have this to be supported.

abhinavkulkarni commented 1 year ago

LiteLLM has support for TGI: https://docs.litellm.ai/docs/providers/huggingface#text-generation-interface-tgi---llms

krrishdholakia commented 1 year ago

Thanks for mentioning us @abhinavkulkarni

Hey @Narsil @jcushman @zfang
Happy to help here.

This is the basic code:

import os 
from litellm import completion 

# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key" 

messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

# e.g. Call 'WizardLM/WizardCoder-Python-34B-V1.0' hosted on HF Inference endpoints
response = completion(model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0", messages=messages, api_base="https://my-endpoint.huggingface.cloud")

print(response)

We also handle prompt formatting - https://docs.litellm.ai/docs/providers/huggingface#models-with-prompt-formatting based on the lmsys/fastchat implementation.

But you can overwrite this with your own changes if necessary - https://docs.litellm.ai/docs/providers/huggingface#custom-prompt-templates

zfang commented 1 year ago

Hi @krrishdholakia,

Thanks for the info. Instead of a client I actually a middle service because I'm trying to host an API server for the chatbot arena https://github.com/lm-sys/FastChat/blob/main/docs/arena.md#how-to-add-a-new-model

I can use vLLM to host a service and provide an OpenAI-compatible API but it's quite slower than TGI. It pains me that TGI doesn't support this. I will probably need to hack a FastChat service to redirect calls to TGI.

Regards,

Felix

krrishdholakia commented 1 year ago

@zfang we have an open-source proxy you can fork and run this through - https://github.com/BerriAI/liteLLM-proxy

would it be helpful if we exposed a cli command to deploy this through?

litellm --deploy
abhinavkulkarni commented 1 year ago

LiteLLM has developed an OpenAI wrapper for TGI (and for lots of other model-serving frameworks).

Here are more details: https://docs.litellm.ai/docs/proxy_server

You can set it up as follows:

Set up a local TGI endpoint first:

$ text-generation-launcher 
  --model-id abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
  --trust-remote-code --port 8080 \
  --max-input-length 5376 --max-total-tokens 6144 --max-batch-prefill-tokens 6144 \
  --quantize awq

I have a LiteLLM proxy server on top of that.

$ litellm \
  --model huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq \
  --api_base http://localhost:8080

I am able to successfully obtain responses from openai.ChatCompletion.create endpoint as follows:

>>> import openai
>>> openai.api_key = "xyz"
>>> openai.api_base = "http://0.0.0.0:8000"
>>> model = "huggingface/abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq"
>>> completion = openai.ChatCompletion.create(model=model, messages=[{"role": "user", "content": "How are you?"}])
>>> print(completion)
{
  "object": "chat.completion",
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "I'm fine, thanks. I'm glad to hear that.\n\nI'm",
        "role": "assistant",
        "logprobs": -18.19830319
      }
    }
  ],
  "id": "chatcmpl-7f8f5312-893a-4dab-aff5-3a97a354c2be",
  "created": 1695869575.316254,
  "model": "abhinavkulkarni/codellama-CodeLlama-7b-Instruct-hf-w4-g128-awq",
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 15,
    "total_tokens": 19
  }
}
michaelfeil commented 1 year ago

@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.

https://github.com/Preemo-Inc/text-generation-inference

krrishdholakia commented 1 year ago

Hey @michaelfeil - is TGI close-source now? I can't find other info on this

Screenshot 2023-10-04 at 7 57 04 AM
Narsil commented 1 year ago

We added a restriction in 1.0 which means you cannot use it as a cloud provider as-is without getting a license from us. Most likely it doesn't change anything for you.

More details here: https://github.com/huggingface/text-generation-inference/issues/744

adrianog commented 1 year ago

@zfang @paulcx I implemented this feature on the Apache-2.0 licenced forked project directly in Rust.

https://github.com/Preemo-Inc/text-generation-inference

Can I use this to wrap the official inference API as published by hf? I can't seem to be able to find an example of how to do create models using the hf inference api from llamaindex.

batindfa commented 1 year ago

@abhinavkulkarni hi, How to run a LiteLLM proxy server I use litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://0.0.0.0:8080/generate in linux command, but it appears bash: litellm: command not found

abhinavkulkarni commented 1 year ago

@yanmengxiang1: Please install litellm using pip.

batindfa commented 1 year ago

@abhinavkulkarni yes, I know it. Should I use something like flask to wrappe this TGI? image

abhinavkulkarni commented 1 year ago

Hey @yanmengxiang1:

Run TGI at port 8080. Then run litellm so that it points to TGI:

litellm --model huggingface/meta-llama/Llama-2-70b-chat-hf --api_base http://localhost:8080 --port 8000

You now have OpenAI compatible API endpoint at port 8000.

krrishdholakia commented 1 year ago

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

LarsHill commented 1 year ago

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

It seems this feature is going to be depreciated? So how future proof is it to build an application around it?

krrishdholakia commented 12 months ago

Hey @LarsHill the LiteLLM community is discussing the best approach right now - https://github.com/BerriAI/litellm/discussions/648#discussioncomment-7375276

Some context We'd initially planned on the docker container being an easier replacement ( consistent environment + easier to deploy)

but it might not be ideal. So we're trying to understand what works best (how do you provide a consistent experience + easy ability to set up configs, etc.).

DM'ing you to understand what a good experience here looks like.

bitsnaps commented 8 months ago

@yanmengxiang1 the relevant docs - https://docs.litellm.ai/docs/proxy_server

It seems this feature is going to be depreciated? So how future proof is it to build an application around it?

The text-generation-webui made a huge progress on supporting other providers by including extensions, you can serve the compatible openai's API using this command:

# clone the repo, then cd..

# install deps:
!pip install -q -r requirements.txt --upgrade
# install extensions (openai...)
!pip install -q -r extensions/openai/requirements.txt --upgrade

# download your model (using way allows you to download large models):
!python download-model.py https://huggingface.co/TheBloke/SauerkrautLM-UNA-SOLAR-Instruct-GPTQ 
# this one works better for MemGPT

# serve your model (check the name of the download file/directory):
!python server.py --model TheBloke_SauerkrautLM-UNA-SOLAR-Instruct-GPTQ --n-gpu-layers 24 --n_ctx 2048 --api --nowebui --extensions openai 

# or download a specific file (if using GGUF models):
!python download-model.py https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF  --specific-file dolphin-2.7-mixtral-8x7b.Q2_K.gguf

Your server should be up and running on 5000 (by default):

!curl http://0.0.0.0:5000/v1/completions -H "Content-Type: application/json" -d '{ "prompt": "This is a cake recipe:\n\n1.","max_tokens": 200, "temperature": 1,  "top_p": 0.9, "seed": 10 }'

This way allows you to run any model (even those aren't available on Ollama's docker images), and without hitting huggingface's api and even large models (>= 10Gb) and even if models without an inference API. No litellm neither ollama required.

drbh commented 7 months ago

This functionality is now supported in TGI with the introduction of the Message API and can be used like:

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:3000/v1",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=False
)

print(chat_completion)

Please see the docs here for more details https://huggingface.co/docs/text-generation-inference/messages_api