Open Spiritdude opened 1 year ago
I've been writing some generic FastAPIs to try various ggml libraries using ctransformers:
https://huggingface.co/spaces/matthoffner/wizardcoder-ggml https://huggingface.co/spaces/matthoffner/starchat-ggml
I've made some recent updates to make the API match the OpenAI response format better so it can be used like llama.cpp + OpenAI.
Nice @matthoffner
@Spiritdude I have been thinking of adding an OpenAI-compatible API server but not getting the time to do it. For now if you want to create your own API server, you can use the examples provided by matthoffner as reference. One thing to note is that the models are not thread-safe so you will have to use a lock to prevent concurrent calls.
I'm also trying to make ctransformers compatible with 🤗 Transformers so that it can be used as a drop-in replacement in other projects but it is WIP. See https://github.com/marella/ctransformers/issues/13#issuecomment-1587946731
@marella thanks for ctransformer and all the efforts, very appreciated for the attention to details you have (for doing ctransformer and considerations of compatibility). I'm mainly using llama_python.server and keep all my apps using RESTAPI for compatibility.
@matthoffner great, thanks - will use your small code snippet in the meanwhile.
If you want there is already a rest api that supports ctransformer: https://github.com/ParisNeo/lollms
it allows you to generate text using a distributed or centralized architecture with multiple service nodes and you can connect to it from one or many clients.
It is a many to many server.
It also has some examples to use it as a library. I have creates a front end for it like playground: https://github.com/ParisNeo/lollms-playground
When you install it, by just doing:
pip install lollms
then you set it up using :
lollms-settings
with this you can select binding (ctransformer for example) and then you can select a model then you can also select one of the preconditionned personalities (i have 260 of them)
then you run
lollms-server
This will run a localhost:9600 service
you can run many of them using this:
lollms-server --host 0.0.0.0 --port 9601
you can select any host you want, then you can run the playground or create your own code. you can either use or not the personality preconditionning
I use socketio for generation in order to be able to send
socket.emit('generate_text', { prompt, personality: -1, n_predicts: 1024 , parameters: { temperature: temperatureValue, top_k: topKValue, top_p: topPValue, repeat_penalty: repeatPenaltyValue, // Update with desired repeat penalty value repeat_last_n: repeatLastNValue, // Update with desired repeat_last_n value seed: parseInt(seedValue) }});
personality is the id of mounted personalities in the server. You can mount many allowing the user to choose.
Thanks @ParisNeo very cool
I'm hoping to help get ctransformers added to OpenLLM as well:
@matthoffner Hello, where is your code snippet using ctransfomers as openai-like server backend? Would like use it.
These should still work:
HuggingFace: https://huggingface.co/spaces/matthoffner/wizardcoder-ggml Github: https://github.com/matthoffner/ggml-fastapi
@matthoffner thanks so much, just found it in files tab.
However, I found some weried issue.
The detokenized text return to my client dropped space:
I changed nothing but return the new_text from chat_chunk and streaming to client, do u got any idea?
Thanks @lucasjinreal, I haven't seen this issue.
Here are some recent UIs I built around WizardCoder if you are looking for some client side examples.
Live HTML Editor: https://github.com/matthoffner/wizardcoder-sandbox Chatbot-ui: https://huggingface.co/spaces/matthoffner/starchat-ui
@matthoffner Can u tell me how did u resolve the white space issue? I printted out in stream one by one, it actually didn't have white space between. Meanwhile, it is not my client issue, this client can runs many openai-like server from mine, just not yet for ctransformers.
@matthoffner
async def stream_response(tokens, llm):
try:
iterator: Generator = llm.generate(tokens)
for chat_chunk in iterator:
print(llm.detokenize(chat_chunk), end='', flush=True)
response = {
'choices': [
{
'message': {
'role': 'system',
'content': llm.detokenize(chat_chunk)
},
'finish_reason': 'stop' if llm.is_eos_token(chat_chunk) else 'unknown'
}
]
}
Normally, it should printout in stream with typewriter effect, but it doesn't include white space.....
BTW, I using ctransformers cli without any issue:
outputs = ''
for text in llm(prompt, stream=True):
print(text, end="", flush=True)
outputs += text
print()
What's the best practice to interface ctransformers API and expose it as REST API?
I looked at https://github.com/jquesnelle/transformers-openai-api and tried to change it to use ctransformer, but then I stopped as the changes required increased making it hard(er) to keep in-sync with the original.