marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.79k stars 136 forks source link

REST API? #26

Open Spiritdude opened 1 year ago

Spiritdude commented 1 year ago

What's the best practice to interface ctransformers API and expose it as REST API?

I looked at https://github.com/jquesnelle/transformers-openai-api and tried to change it to use ctransformer, but then I stopped as the changes required increased making it hard(er) to keep in-sync with the original.

matthoffner commented 1 year ago

I've been writing some generic FastAPIs to try various ggml libraries using ctransformers:

https://huggingface.co/spaces/matthoffner/wizardcoder-ggml https://huggingface.co/spaces/matthoffner/starchat-ggml

I've made some recent updates to make the API match the OpenAI response format better so it can be used like llama.cpp + OpenAI.

marella commented 1 year ago

Nice @matthoffner


@Spiritdude I have been thinking of adding an OpenAI-compatible API server but not getting the time to do it. For now if you want to create your own API server, you can use the examples provided by matthoffner as reference. One thing to note is that the models are not thread-safe so you will have to use a lock to prevent concurrent calls.

I'm also trying to make ctransformers compatible with 🤗 Transformers so that it can be used as a drop-in replacement in other projects but it is WIP. See https://github.com/marella/ctransformers/issues/13#issuecomment-1587946731

Spiritdude commented 1 year ago

@marella thanks for ctransformer and all the efforts, very appreciated for the attention to details you have (for doing ctransformer and considerations of compatibility). I'm mainly using llama_python.server and keep all my apps using RESTAPI for compatibility.

@matthoffner great, thanks - will use your small code snippet in the meanwhile.

ParisNeo commented 1 year ago

If you want there is already a rest api that supports ctransformer: https://github.com/ParisNeo/lollms

it allows you to generate text using a distributed or centralized architecture with multiple service nodes and you can connect to it from one or many clients.

It is a many to many server.

It also has some examples to use it as a library. I have creates a front end for it like playground: https://github.com/ParisNeo/lollms-playground

When you install it, by just doing:

pip install lollms

then you set it up using :

lollms-settings

with this you can select binding (ctransformer for example) and then you can select a model then you can also select one of the preconditionned personalities (i have 260 of them)

then you run

lollms-server

This will run a localhost:9600 service

you can run many of them using this:

lollms-server --host 0.0.0.0 --port 9601

you can select any host you want, then you can run the playground or create your own code. you can either use or not the personality preconditionning

I use socketio for generation in order to be able to send

socket.emit('generate_text', { prompt, personality: -1, n_predicts: 1024 , parameters: { temperature: temperatureValue, top_k: topKValue, top_p: topPValue, repeat_penalty: repeatPenaltyValue, // Update with desired repeat penalty value repeat_last_n: repeatLastNValue, // Update with desired repeat_last_n value seed: parseInt(seedValue) }});

personality is the id of mounted personalities in the server. You can mount many allowing the user to choose.

matthoffner commented 1 year ago

Thanks @ParisNeo very cool

I'm hoping to help get ctransformers added to OpenLLM as well:

https://github.com/bentoml/OpenLLM/issues/24

lucasjinreal commented 1 year ago

@matthoffner Hello, where is your code snippet using ctransfomers as openai-like server backend? Would like use it.

matthoffner commented 1 year ago

These should still work:

HuggingFace: https://huggingface.co/spaces/matthoffner/wizardcoder-ggml Github: https://github.com/matthoffner/ggml-fastapi

lucasjinreal commented 1 year ago

@matthoffner thanks so much, just found it in files tab.

However, I found some weried issue.

The detokenized text return to my client dropped space:

image

I changed nothing but return the new_text from chat_chunk and streaming to client, do u got any idea?

matthoffner commented 1 year ago

Thanks @lucasjinreal, I haven't seen this issue.

Here are some recent UIs I built around WizardCoder if you are looking for some client side examples.

Live HTML Editor: https://github.com/matthoffner/wizardcoder-sandbox Chatbot-ui: https://huggingface.co/spaces/matthoffner/starchat-ui

lucasjinreal commented 1 year ago

@matthoffner Can u tell me how did u resolve the white space issue? I printted out in stream one by one, it actually didn't have white space between. Meanwhile, it is not my client issue, this client can runs many openai-like server from mine, just not yet for ctransformers.

lucasjinreal commented 1 year ago

@matthoffner

async def stream_response(tokens, llm):
    try:
        iterator: Generator = llm.generate(tokens)
        for chat_chunk in iterator:
            print(llm.detokenize(chat_chunk), end='', flush=True)
            response = {
                'choices': [
                    {
                        'message': {
                            'role': 'system',
                            'content': llm.detokenize(chat_chunk)
                        },
                        'finish_reason': 'stop' if llm.is_eos_token(chat_chunk) else 'unknown'
                    }
                ]
            }

Normally, it should printout in stream with typewriter effect, but it doesn't include white space.....

lucasjinreal commented 1 year ago

BTW, I using ctransformers cli without any issue:

outputs = ''
        for text in llm(prompt, stream=True):
            print(text, end="", flush=True)
            outputs += text
        print()