microsoft / BitNet

Official inference framework for 1-bit LLMs
MIT License
11.39k stars 769 forks source link

-n parameter and relevancy #60

Closed alexeyvolkoff closed 3 weeks ago

alexeyvolkoff commented 1 month ago

I've noticed that if -n parameter is big, and the answer is short, it starts to 'elaborate' on the topic just to predict the requested number of tokens. Even if relevancy of generated text drops with every next word. Is it possible to stop the generation on the completed sentence if the token's relevancy is under some threshold?

And in case when it generates quite a good text, it just stops in the middle of the sentence when it reaches -n limit. Why not let it finish the sentence?

potassiummmm commented 3 weeks ago

This parameter is provided by the original llama.cpp, check this section for details. https://github.com/ggerganov/llama.cpp/tree/master/examples/main#number-of-tokens-to-predict