huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.82k stars 1.03k forks source link

"UnknownError: A vlid user token is required" #747

Closed kmaurinjones closed 1 year ago

kmaurinjones commented 1 year ago


I have a HF Inference Endpoint that I'm trying to use for a variety of purposes, and I keep running into this same error: "UnknownError: A valid user token is required". I believe I've isolated the cause to be the HuggingFace Text Generation library, even with just the following code:

from text_generation import Client

endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud" # of course I used my real inference endpoint (which is a Flan-T5-Large model for the text generation task)

client = Client(endpoint_url)
text = client.generate("Why is the sky blue?").generated_text
print(text)
# ' Rayleigh scattering'

# Token Streaming
text = ""
for response in client.generate_stream("Why is the sky blue?"):
    if not response.token.special:
        text += response.token.text

print(text)
# ' Rayleigh scattering'

>>> UnknownError: A valid user token is required```

### Information

- [ ] Docker
- [ ] The CLI directly

### Tasks

- [ ] An officially supported command
- [ ] My own modifications

### Reproduction

See above code

### Expected behavior

I would like it to work as intended
Narsil commented 1 year ago

Hi @kmaurinjones ,

You are missing the authorization token to access your inference endpoint token.

client = Client(endpoint_url, headers={"Authorization": f"Bearer {TOKEN}"})

Should solve the issue.

@philschmid in case there's an easier solution.

Closing since I think it should be the solution, please comment if that doesn't fix it.

kmaurinjones commented 1 year ago

@Narsil Thanks very much. This did solve this problem. I'm having another, similar issue with trying to get either T2TGen or TGen Inference via HF Inference Endpoint in a LangChain QA chain. With the free Inference Endpoint, everything works exactly as it should and there is effectively no token limit while using the LLM (as it the point of me using LangChain). Conversely, when I try to also use the paid Inference Endpoint, there is for some reason a token limit and things don't seem to be processed by LangChain in the same way.

Code and error are below. Would be very grateful if you could provide any insight on this one. Have tried consulting documentation and forums and nothing has solved the issue.

### textsplitter(), faiss vectordb, etc, ... all work fine

llm = HuggingFaceHub(repo_id = "google/flan-t5-large", model_kwargs = {"temperature": 0, "max_length": 512}) # <------ THIS WORKS AND HAS NO TOKEN LIMITATION
llm = HuggingFaceEndpoint(endpoint_url = endpoint_url, model_kwargs = {"temperature": 0, "max_length": 512}) # <------ THIS HAS A TOKEN LIMITATION
chain = load_qa_chain(llm, chain_type = "stuff", verbose = False)
### rest of chain, which works fine

>>> ValueError: Error raised by inference API: Input validation error: `inputs` must have less than 1024 tokens. Given:
1416
Narsil commented 1 year ago

I think you should be able to set your settings in your deployment in HFE to adapt the number of tokens etc... I think everything is Advanced options once you choose text-generation-inference.

@philschmid

kmaurinjones commented 1 year ago

@Narsil @philschmid

The token limits are the same for both the free endpoint and the paid endpoint, and the size I have for the chunk splits in LangChain is well within this token limit. The text chunks are each < 512, even including the query I pass to the model (so query + chunk text is still < 512), and Max Input Length of the paid endpoint = 1024, and Max Number of Tokens = 1512, so the chosen chunk from LangChain should always be able to fit into both models, which is why I'm confused that specifically the free endpoint doesn't run into a token limit error, and the paid endpoint model does. I've used the same text and everything else for testing and when this is the only difference in the pipeline, I get the above token limit error.

I could increase the max number of input tokens at the paid endpoint but that would only solve this for this particular document; many of them are much larger than this. So it seems to me like the LangChain chain doesn't play nice with the HuggingFaceEndpoint() class in the same way it does with the HuggingFaceHub(), and I'm not sure why.

philschmid commented 1 year ago

as @Narsil said you can change those when creating your endpoint in the advanced section. image

kmaurinjones commented 1 year ago

But that's exactly my point, unless I'm misunderstanding something, the chunk returned by the vector db that is then passed to the model should be a far smaller number of tokens than the max model input (1024), whether I use the flant5large model from the free endpoint or the paid endpoint (exact same model in both endpoints). It seems to be working fine when I use the free endpoint with the HuggingFaceHub() object as the llm in the chain, but when I change the llm to be the HuggingFaceEndpoint() object, I get the error above.