A chatbot web app + HTTP and WebSocket endpoints for LLM inference with the Petals client
You can try it out at https://chat.petals.dev or run the backend on your server using these commands:
git clone https://github.com/petals-infra/chat.petals.dev.git
cd chat.petals.dev
pip install -r requirements.txt
flask run --host=0.0.0.0 --port=5000
π¦ Want to serve Llama 2? Request access to its weights at the βΎοΈ Meta AI website and π€ Model Hub, then run huggingface-cli login
in the terminal before starting the web app. If you don't want Llama 2, just remove the meta-llama
models from config.py.
π¦ Deploying with Gunicorn. In production, we recommend using gunicorn instead of the Flask dev server:
gunicorn app:app --bind 0.0.0.0:5000 --worker-class gthread --threads 100 --timeout 1000
The chat uses the WebSocket API under the hood.
The backend provides two APIs endpoints:
/api/v2/generate
, recommended)/api/v1/...
)Please use the WebSocket API when possible - it is much faster, more powerful, and consumes less resources.
If you develop your own web app, you can use our endpoint at https://chat.petals.dev/api/...
for research and development, then set up your own backend for production using the commands above.
Note: We do not recommend using the endpoint at
https://chat.petals.dev/api/...
in production. It has a limited throughput, and we may pause or stop it any time.
/api/v2/generate
)This API implies that you open a WebSocket connection and exchange JSON-encoded requests and responses. This may be done from any programming language.
π Using Python on Linux/macOS? Please consider running the native Petals client instead. This way, you can connect to the swarm directly (without this API endpoint) and even run fine-tuning.
The requests must follow this protocol:
The first request must be of type open_inference_session and include these parameters:
Notes:
Request:
{type: "open_inference_session", max_length: 1024}
Response:
{ok: true} // If successful
{ok: false, traceback: "..."} // If failed
The next requests must be of type generate and include the same parameters as in the /api/v1/generate HTTP API. In contrast to HTTP API, you can use this API in streaming fashion, generating a response token-by-token and accepting intermediate prompts from a user (e.g., to make a chatbot).
A new feature of the WebSocket API is the stop_sequence
parameter (str, optional). If you set it, the server will continue generation with the same parameters unless it generates the stop_sequence
, so you may get multiple responses without having to send the request again and wait for the round trip's latency.
Intermediate responses contain the field stop: false
, and the last response contains stop: true
. For example, you can set max_new_tokens: 1
and receive tokens one by one, as soon as they are generated. Check out the chat's frontend code for a detailed example of how to do that.
Request:
{type: "generate", "inputs": "A cat in French is \"", "max_new_tokens": 3}
Response (one or multiple):
{ok: true, outputs: "chat\".", stop: true} // If successful
{ok: false, traceback: "..."} // If failed
/api/v1/...
)Parameters:
Generation parameters (compatible with .generate() from π€ Transformers):
0
(default), runs greedy generation.
If 1
, performs sampling with parameters below.Notes:
max_length
or max_new_tokens
.do_sample=0
(default).do_sample=1, temperature=0.6, top_p=0.9
.Returns (JSON):
ok == False
Example (curl):
$ curl -X POST "https://chat.petals.dev/api/v1/generate" -d "model=meta-llama/Llama-2-70b-chat-hf" -d "inputs=Once upon a time," -d "max_new_tokens=20"
{"ok":true,"outputs":" there was a young woman named Sophia who lived in a small village nestled in the rolling hills"}