Workaround for HuggingFace streaming API

dluc commented 1 month ago

Motivation and Context (Why the change? What's the scenario?)

HuggingFace API doesn't support top_p == 0 and doesn't stop correctly generating tokens after max_tokens.

See https://github.com/microsoft/kernel-memory/issues/388

High level description (Approach, Design)

Add example showing how to use HF API (setting top_p to 0.01)
Change OpenAI client to stop generating tokens after max_tokens has been reached. Tokens are counted using the provided tokenizer.

TODO: investigate whether there's a workaround that doesn't depend on the tokenizer. TODO: consider an option for HF, to automatically manage top_p range 0. < x < 1.0

dluc commented 1 month ago

Update: issue reported to HF project here https://github.com/huggingface/text-generation-inference/issues/1896

dluc commented 1 month ago

Deferred, let's see if either HF or Azure can fix either the server or the client

microsoft / kernel-memory

Workaround for HuggingFace streaming API #484

Motivation and Context (Why the change? What's the scenario?)

High level description (Approach, Design)