microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.35k stars 252 forks source link

Workaround for HuggingFace streaming API #484

Closed dluc closed 1 month ago

dluc commented 1 month ago

Motivation and Context (Why the change? What's the scenario?)

HuggingFace API doesn't support top_p == 0 and doesn't stop correctly generating tokens after max_tokens.

See https://github.com/microsoft/kernel-memory/issues/388

High level description (Approach, Design)

TODO: investigate whether there's a workaround that doesn't depend on the tokenizer. TODO: consider an option for HF, to automatically manage top_p range 0. < x < 1.0

dluc commented 1 month ago

Update: issue reported to HF project here https://github.com/huggingface/text-generation-inference/issues/1896

dluc commented 1 month ago

Deferred, let's see if either HF or Azure can fix either the server or the client