huggingface / huggingface.js

Utilities to use the Hugging Face Hub API
https://hf.co/docs/huggingface.js
MIT License
1.36k stars 201 forks source link

Errors with Conversational Endpoint #488

Closed Kardbord closed 4 months ago

Kardbord commented 7 months ago

Apologies if this doesn't belong here, but #457 is the only place I've found any information pertaining to my issue.

I maintain an inference endpoint wrapper for Go (Kardbord/hfapigo), and noticed that I have not been able to successfully make requests to the conversational endpoint for about a month.

To reproduce:

curl https://api-inference.huggingface.co/models/microsoft/DialoGPT-large \
        -X POST \
        -d '{"inputs": {"past_user_inputs": ["Which movie is the best ?"], "generated_responses": ["It is Die Hard for sure."], "text":"Can you explain why ?"}}' \
        -H "Authorization: Bearer ${HF_API_TOKEN}"

This gives

{
  "error": "unknown error",
  "warnings": [
    "There was an inference error: unknown error: can only concatenate str (not \"dict\") to str"
  ]
}

Based on #457, it seems like the conversational endpoint may be going away, replaced with text-generation. If that's so, I have a couple of questions.

  1. Is there somewhere I can watch for future updates like this so I'm not caught off-guard when they happen?
  2. How does one make a conversational request of a text-generation model? I tried to figure this out by reading through #457, but unfortunately I'm particularly fluent in typescript so a lot of it went over my head.
coyotte508 commented 7 months ago

cc @Wauplin @SBrandeis I guess

julien-c commented 7 months ago

sorry about the breaking change @Kardbord, we'll provide a curl request to replicate your previous one

Kardbord commented 7 months ago

Much appreciated! Thank you for your help.

Wauplin commented 4 months ago

Hi @Kardbord, sorry for the very long delay before getting back to you but I now have a complete answer for you! :hugs:

Context

What you need to know:

  1. the legacy conversational format has been deprecated in favor of the Chat Completion API made popular by the OpenAI API.
  2. Language models on Inference API are served by one of our two backend frameworks, depending on the model.
    1. Most LLMs are served using text-generaiton-inference (TGI). This is our modern framework dedicated for inference that we use for serving the popular LLMs: llama3, mistral, gemma, starcoder, etc. You can find the list of deployed models here (cached list so not always 100% accurate but gives you a good idea). TGI-served models have a /v1/chat/completion route dedicated to Chat Completion.
    2. Other served LMs are served using the transformers backend, same as many models for other tasks in InferenceAPI. Those models only expose a text-generation pipeline under the / route. It is possible to provide a list of messages as input and the output will be a string (i.e. the response from the "assistant").
    3. A lot of LM/LLMs models from the Hub are simply not served in our serverless Inference API for costs reasons (and because of lower popularity). It is still possible to deploy them on a (paid) dedicated Inference Endpoint in which you can chose the backend you want.

cURL examples

List models served with TGI

$ curl -s 'https://api-inference.huggingface.co/framework/text-generation-inference' |
        jq -r '.[] |.model_id' |
        sort | 
        uniq
# bigcode/octocoder
# bigcode/santacoder
# bigcode/starcoder
# bigcode/starcoder2-15b
# bigcode/starcoder2-3b
# bigscience/bloom
# codellama/CodeLlama-13b-hf
# ...

Chat completion request (TGI)

$ curl https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3-70B-Instruct/v1/chat/completions \
       -H "Authorization":"Bearer hf_***" \
       -H "Content-Type":"application/json" \
       -X POST \
         -d '{
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": "What is deep learning?"
        }
      ],
      "model": "tgi",
      "max_tokens": 20
    }'

# {"id":"","object":"text_completion","created":1714736594,"model":"meta-llama/Meta-Llama-3-70B-Instruct","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"message":{"role":"assistant","content":"Deep learning is a subset of machine learning that involves the use of artificial neural networks to model and solve"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":26,"completion_tokens":20,"total_tokens":46}}

(note: passing "model": "tgi" is only to comply with expected format but value is ignored)

If you pass "stream": true in body, you will receive a stream of events (one token == one event).

Chat with transformers's text-generation

None the parameters are passed as "parameters": {} and not at the root of the payload. Also parameters can be different max_tokens vs max_new_tokens and some do not exist. In particular, it is not possible to stream tokens from transformers. Parameters are detailed on this page.

$ curl https://api-inference.huggingface.co/models/microsoft/DialoGPT-large \
           -H "Authorization":"Bearer hf_***" \
           -H "Content-Type":"application/json" \
           -X POST \
             -d '{
          "inputs": [
            {
              "role": "system",
              "content": "You are a helpful assistant."
            },
           {
              "role": "user",
              "content": "What is deep learning?"
            }
          ],
          "parameters": {"max_new_tokens": 20}
        }'
[{"generated_text":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is deep learning?"},{"role":"assistant","content":"It's a computer science term."}]}]

In Python client

You can have a look at the following PRs to check how it was implemented in the Python client:

Please let me know if you have any remaining questions!

Kardbord commented 4 months ago

Very much appreciated for all your help! This should get me back in business.