turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

[Bug]: Sampling fails when temperature is 0 #226

Open kogolobo opened 1 year ago

kogolobo commented 1 year ago

This line in generator.py yields infinite logits when temperature is set to 0: https://github.com/turboderp/exllama/blob/c16cf49c3f19e887da31d671a713619c8626484e/generator.py#L106C1-L106C30 image

Debugger result: image

turboderp commented 1 year ago

Temperature = 0 is an invalid argument the way temperature is defined here. I don't know if other implementations treat this as a special case or not, but the only sensible interpretation I can think of is that temperature = 0 should be equivalent to top-k = 1.

For the sake of numerical stability, a robust "fix" would have to take into account any numerical instability as the temperature approaches zero. What's the desired behavior here? Should any temperature less than some small threshold result just trigger greedy sampling?

kogolobo commented 1 year ago

I believe temp=0 should be equal to greedy decoding (as you mentioned same as top_k = 1). I really like your suggestion of selecting a small threshold that takes numerical instability into account😄

turboderp commented 1 year ago

But what's the typical behavior in other implementations? If I'm overriding undefined behavior arbitrarily anyway, I'd want to be as unsurprising as possible.

kogolobo commented 1 year ago

Good question.

It seems that HF is able to decide on the greedy decoding by other means, and does not even look at temperature setting.

VLLM, however, similar to your suggestion, compares temperature to a small constant and replaces it with "1.0" if it is below the constant.

llama-cpp-python however, only checks for equality with 0, in which case it does greedy decoding.