turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.78k stars 220 forks source link

Streaming API #37

Open bkutasi opened 1 year ago

bkutasi commented 1 year ago

Foremost, this is a terrific project. I've been trying to integrate it with other apps, but the API is a little bit different compared to other implementations like KobolAI and its API or textgen-webui and its API examples. I could get it to work (while the webapp is running) with the following script with my limited knowledge, albeit it's not the best:

import requests
import json
import sys

url = 'http://0.0.0.0:5005/api/userinput'
data = {'user_input': 'What time is it? Write a very looong essay about time.'}
headers = {'Content-type': 'application/json'}

# send the POST request and stream the response
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)

# extract the text values from the JSON response
text_values = (json.loads(line).get('text') for line in response.iter_lines())
for text_value in text_values:
    print(text_value, end="")
    sys.stdout.flush() # flush the output buffer

What do you think about the possibility of making a streaming api endpoint on /api/stream that is not connected with the backend user handling and message saving, and is "stateless" so it follows the REST principles? Since it's one of the most performant backends this would surely boost its popularity.

turboderp commented 1 year ago

There are some people already working on APIs. But it is on my list. I just need to do a little more research to figure out what the best, minimal stateless API would look like.

disarmyouwitha commented 1 year ago

@bkutasi I have a (very) basic "stateless" API wrapper for exllama that might point you in the right direction: https://github.com/disarmyouwitha/exllama/blob/master/fast_api.py https://github.com/disarmyouwitha/exllama/blob/master/fastapi_chat.html https://github.com/disarmyouwitha/exllama/blob/master/fastapi_request.py

fast_api.py is just a FastAPI wrapper around the model and generate_simple functions. It takes the -d command for the model directory. It will load the model and start listening on port 7862 for POST requests to http://localhost:7862/generate

You can go to /chat to load the HTML through FastAPI, which will allow you to load the page via browser.

fastapi_request.py is an example script of how to call the API from python.

This is just a quick implementation, I will actually be revisiting this code to work in some of the new improvements Turboderp made... after I get in a bit of Diablo4 this week ^^;

bkutasi commented 1 year ago

Your implementation looks great, I will try it out right away. Would love to see it merged down the line(in some form) into the main branch.

bkutasi commented 1 year ago

@disarmyouwitha your fast api is working great, but the web interface is not sending generation requests if its not accessed through the localhost, even when listening (0.0.0.0). Probably other requests are also not sent, but the page loads. Basically everything jinja2 related to work but the other two does not. Sorry for mentioning it here but i didn't see issue reporting active on your repo i hope turboderp wont mind it, otherwise lets move.

disarmyouwitha commented 1 year ago

@bkutasi oh hm, I never noticed you had to enable issues - I have opened up the issues tab in my repo if you continue to have problems we can follow up there =]

Are you accessing the GUI by clicking the .html file, or by going to http://host:7862/chat?

If accessing it through the HTML file it will always assume localhost:

// Check if the page was loaded from FastAPI or opened independently
if (!window.location.href.startsWith("http://{{host}}:{{port}}/")) 
{
    host = "localhost";
    port = "7862";
}

If accessing through /chat it should be trying to determine your host like this:

@app.get("/chat")
async def chat(request: Request, q: Union[str, None] = None):
    return templates.TemplateResponse("fastapi_chat.html", {"request": request, "host": socket.gethostname(), "port": _PORT})

(But maybe I was trying to be too clever and broke something)

I have the FastAPI running on a headless server, so I access the page like this: http://wintermute:7862/chat

And in fastapi_requests.py I use: r = requests.post("http://wintermute:7862/generate", json=data, stream=True)

It may be worth mentioning that you will probably need to forward port 7862 to access it from another machine: sudo ufw allow 7862