finetuned Llama-2-7B-32K-Instruct-GPTQ only returns '\n'

Hello!

Been experiencing a problem in exllama (both versions) with this particular model. The model only outputs '\n' when used with exllama.

I've came across this problem two different ways:

1 Finetuning togethercomputer/Llama-2-7B-32K-Instruct First, finetune the model using axolotl, which is an easy to use wrapper around Hugging Face's trainer class. Finetune on an specific dataset following alpaca format on 1xA100. Each instruction is roughly 8k tokens, which som of them being 12-16k tokens.

Then, get the adapter model and merge it into the base model.

As an intermediate step, I perform a custom evaluation on a specific benchmark tailored to the task I'm trying to solve. I perform this evaluation using transformers library. Accurracy is high, around 83%

Then, as I want to use exllama because of the speed and memory gains, specially in long contexts, i perform GPTQ quantization, using AutoGPTQ's sample script. 4 bits, 128g and act_order=True.

Then, I perform the same benchmark but adapted to use exllama (v1 and v2) as the engine to execute the model, and the model only spits '\n'. Example inference files give the same outputs.

2 Finetuning TheBloke/Llama-2-7B-32K-Instruct-GPTQ

Very similar to the previous experiment. Used axolotl to train that GPTQ model, and exllama v1 to perform the inference with the LoRa adapter. Same results as the previous experiment.

Similar experiments performed on meta-llama/llama-2-7b-hf performed as expected, so maybe there is no support for rope scaling.`

The only thing that raises my suspicions is a message about the tokenizer that show when quantizing the models: Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Has anyone experienced something similar? Is rope scaling supported on exllama?

turboderp / exllama

finetuned Llama-2-7B-32K-Instruct-GPTQ only returns '\n' #298