Closed Kardbord closed 4 months ago
cc @Wauplin @SBrandeis I guess
sorry about the breaking change @Kardbord, we'll provide a curl request to replicate your previous one
Much appreciated! Thank you for your help.
Hi @Kardbord, sorry for the very long delay before getting back to you but I now have a complete answer for you! :hugs:
What you need to know:
conversational
format has been deprecated in favor of the Chat Completion API made popular by the OpenAI API.text-generaiton-inference
(TGI). This is our modern framework dedicated for inference that we use for serving the popular LLMs: llama3, mistral, gemma, starcoder, etc. You can find the list of deployed models here (cached list so not always 100% accurate but gives you a good idea). TGI-served models have a /v1/chat/completion
route dedicated to Chat Completion.transformers
backend, same as many models for other tasks in InferenceAPI. Those models only expose a text-generation
pipeline under the /
route. It is possible to provide a list of messages as input and the output will be a string (i.e. the response from the "assistant").$ curl -s 'https://api-inference.huggingface.co/framework/text-generation-inference' |
jq -r '.[] |.model_id' |
sort |
uniq
# bigcode/octocoder
# bigcode/santacoder
# bigcode/starcoder
# bigcode/starcoder2-15b
# bigcode/starcoder2-3b
# bigscience/bloom
# codellama/CodeLlama-13b-hf
# ...
$ curl https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct/v1/chat/completions \
-H "Authorization":"Bearer hf_***" \
-H "Content-Type":"application/json" \
-X POST \
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"model": "tgi",
"max_tokens": 20
}'
# {"id":"","object":"text_completion","created":1714736594,"model":"meta-llama/Meta-Llama-3-70B-Instruct","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning that involves the use of artificial neural networks to model and solve"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":26,"completion_tokens":20,"total_tokens":46}}
(note: passing "model": "tgi"
is only to comply with expected format but value is ignored)
If you pass "stream": true
in body, you will receive a stream of events (one token == one event).
text-generation
None the parameters are passed as "parameters": {}
and not at the root of the payload. Also parameters can be different max_tokens
vs max_new_tokens
and some do not exist. In particular, it is not possible to stream tokens from transformers
. Parameters are detailed on this page.
$ curl https://api-inference.huggingface.co/models/microsoft/DialoGPT-large \
-H "Authorization":"Bearer hf_***" \
-H "Content-Type":"application/json" \
-X POST \
-d '{
"inputs": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"parameters": {"max_new_tokens": 20}
}'
[{"generated_text":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is deep learning?"},{"role":"assistant","content":"It's a computer science term."}]}]
You can have a look at the following PRs to check how it was implemented in the Python client:
Please let me know if you have any remaining questions!
Very much appreciated for all your help! This should get me back in business.
Apologies if this doesn't belong here, but #457 is the only place I've found any information pertaining to my issue.
I maintain an inference endpoint wrapper for Go (Kardbord/hfapigo), and noticed that I have not been able to successfully make requests to the conversational endpoint for about a month.
To reproduce:
This gives
Based on #457, it seems like the conversational endpoint may be going away, replaced with text-generation. If that's so, I have a couple of questions.