turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Infinities during model evaluation #207

Closed 50h100a closed 1 year ago

50h100a commented 1 year ago

I'm noticing infinities appearing in the hidden state during decoding (the MLP is the immediate culprit, even while unfused). This unsurprisingly makes the probs very very unhappy.

Any advice on potential causes, or how to debug this?

turboderp commented 1 year ago

That's not a lot to go on. Any other details like, what model etc..?

50h100a commented 1 year ago

Ah, sorry. I was testing with https://huggingface.co/TheBloke/Chronos-Hermes-13B-SuperHOT-8K-GPTQ, with args -l 8192 -a 4. I've verified the hash of the safetensors file. I am calling into ExLLaMA using a custom python script though, could that be a source of issues?

Prompting is relatively simple: I start with "This is a roleplay between Alice and Bob.\nAlice is trying to get help from Bob on her science homework.\nThe roleplay between Alice and Bob begins.\n### Response:\nAlice: ", then alternate between "\n### Response:\nBob: " and "\n### Response:\nAlice: ". I usually hit infinities after only a few generations, though the output is... comprehensible until then.

I'm trying to get some intuition for how these kinds of failures creep in, and methods to debug them.

I've noticed that hidden_state often develops max values of several thousand during decoding; is that expected? I would have expected much smaller ranges, especially if there's only 4 bits of parameter weight to play with.

turboderp commented 1 year ago

Well, it's always difficult to debug. The hidden state is repeatedly normalized, so if it grows out of control I would suspect some problem with the weights or how the model is being used. In your case you're using -a 4 which isn't correct for SuperHOT, which is tuned to use RoPE scaling. You'll want to use -cpe 4 instead for this particular model.

That's probably the issue here, but as for how you'd go about detecting that... yeah, it's a black box at the end of the day. You can compare to a reference, say the unquantized version of the model running in HF Transformers, then try to find where the hidden state deviates more than it should due to quantization. But that'll only help find some bug in the implementation, it won't highlight a problem with the model itself or how it's being used.

50h100a commented 1 year ago

Well, nuts.

Unfortunately, it looks like switching to -cpe 4 didn't fix it, still running into the same issue, complete with hidden_state values climbing into the 7000s.

turboderp commented 1 year ago

Based on this I would suspect there's something wrong with how you're using the implementation. Maybe you're not resetting the cache between each call to model.forward() or something? There's an endless list of possibilities, really. Do you have a complete, minimal example to reproduce the behavior?

50h100a commented 1 year ago

I come bearing good, but also potentially much worse, news.

I've tried multiple different projects, even ones that run with FP16 precision, and it seems that some models on HF just... spit out infinities sometimes. I think it may be something to do with the LoRA process, but that's a complete guess on my part.

It's either a problem with the models, or at least three separate projects (using GPTQ, GGML, and FP16 respectively) all have a bug with the same symptoms.

It's also possible that I'm just extremely bad at everything, but I would like to think I wouldn't have made the exact same error in three very different setups.

turboderp commented 1 year ago

Well, language models are black boxes. There's really nothing to constrains the values in the hidden state, other than some normalization over the forward pass. In some years it may be possible to use a much larger language model to interpret the hidden state of a smaller model, but for now that's kind of unrealistic.

And lots of stuff can go wrong during finetuning, especially with how confused some people are about the whole process. If a model shows the same behavior in quantized and full-precision versions, on different implementations, I'd say it's probably just a bad model.

turboderp commented 1 year ago

Closing this for now unless there's some indication that the behavior is down to a bug in the implementation rather than the model weights.