Closed kmaurinjones closed 1 year ago
Hi @kmaurinjones ,
You are missing the authorization token to access your inference endpoint token.
client = Client(endpoint_url, headers={"Authorization": f"Bearer {TOKEN}"})
Should solve the issue.
@philschmid in case there's an easier solution.
Closing since I think it should be the solution, please comment if that doesn't fix it.
@Narsil Thanks very much. This did solve this problem. I'm having another, similar issue with trying to get either T2TGen or TGen Inference via HF Inference Endpoint in a LangChain QA chain. With the free Inference Endpoint, everything works exactly as it should and there is effectively no token limit while using the LLM (as it the point of me using LangChain). Conversely, when I try to also use the paid Inference Endpoint, there is for some reason a token limit and things don't seem to be processed by LangChain in the same way.
Code and error are below. Would be very grateful if you could provide any insight on this one. Have tried consulting documentation and forums and nothing has solved the issue.
### textsplitter(), faiss vectordb, etc, ... all work fine
llm = HuggingFaceHub(repo_id = "google/flan-t5-large", model_kwargs = {"temperature": 0, "max_length": 512}) # <------ THIS WORKS AND HAS NO TOKEN LIMITATION
llm = HuggingFaceEndpoint(endpoint_url = endpoint_url, model_kwargs = {"temperature": 0, "max_length": 512}) # <------ THIS HAS A TOKEN LIMITATION
chain = load_qa_chain(llm, chain_type = "stuff", verbose = False)
### rest of chain, which works fine
>>> ValueError: Error raised by inference API: Input validation error: `inputs` must have less than 1024 tokens. Given:
1416
I think you should be able to set your settings in your deployment in HFE to adapt the number of tokens etc...
I think everything is Advanced options
once you choose text-generation-inference
.
@philschmid
@Narsil @philschmid
The token limits are the same for both the free endpoint and the paid endpoint, and the size I have for the chunk splits in LangChain is well within this token limit. The text chunks are each < 512, even including the query I pass to the model (so query + chunk text is still < 512), and Max Input Length of the paid endpoint = 1024, and Max Number of Tokens = 1512, so the chosen chunk from LangChain should always be able to fit into both models, which is why I'm confused that specifically the free endpoint doesn't run into a token limit error, and the paid endpoint model does. I've used the same text and everything else for testing and when this is the only difference in the pipeline, I get the above token limit error.
I could increase the max number of input tokens at the paid endpoint but that would only solve this for this particular document; many of them are much larger than this. So it seems to me like the LangChain chain doesn't play nice with the HuggingFaceEndpoint() class in the same way it does with the HuggingFaceHub(), and I'm not sure why.
as @Narsil said you can change those when creating your endpoint in the advanced section.
But that's exactly my point, unless I'm misunderstanding something, the chunk returned by the vector db that is then passed to the model should be a far smaller number of tokens than the max model input (1024), whether I use the flant5large model from the free endpoint or the paid endpoint (exact same model in both endpoints). It seems to be working fine when I use the free endpoint with the HuggingFaceHub() object as the llm in the chain, but when I change the llm to be the HuggingFaceEndpoint() object, I get the error above.