turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Codellama support #260

Open lucasjinreal opened 10 months ago

lucasjinreal commented 10 months ago

exllama/model.py", line 45, in init self.pad_token_id = read_config["pad_token_id"] KeyError: 'pad_token_id'

dred0n commented 10 months ago

Just add this to the config.json

"pad_token_id": 0,

ShahZ181 commented 10 months ago

Just add this to the config.json

"pad_token_id": 0, Where is the config.json?

pan324 commented 10 months ago

It's the config.json that should be part of your files: https://huggingface.co/TheBloke/CodeLlama-13B-Python-GPTQ/tree/main

So did anyone managed to get coherent sentences out of the model yet? It barely acknowledges my questions.

ShahZ181 commented 10 months ago

So did anyone managed to get coherent sentences out of the model yet? It barely acknowledges my questions.

I have tried the Phind-CodeLlama-34B on example-chatbot.py and output is really bad and repeats words endlessly. I have read that people have gotten it to work so maybe its an exllama issue idk. I am new to all of this

I also tried the new WizardCoder-Python-34B but it gives me this error: with safe_open(self.config.model_path, framework = "pt", device = "cpu") as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

dred0n commented 10 months ago

WizardCoder-Python-34B works well for me. All the other TheBloke models seem defective.

ShahZ181 commented 10 months ago

I also tried the new WizardCoder-Python-34B but it gives me this error: with safe_open(self.config.model_path, framework = "pt", device = "cpu") as f: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

I fixed this issue by deleting the model and downloading it again. And i can confirm WizardCoder Python is the only one that work well so far for me

lucasjinreal commented 10 months ago

@dred0n I think you were right.

especially these quantized models. (might mainly caused by quantize).

Did u tested WizardCoder-34B with quantize and exllama??

lucasjinreal commented 10 months ago

@dred0n Hi, can u share your Wizardcoder34B quantized model? GPTQ?

dred0n commented 10 months ago

@lucasjinreal Yes, it works well. I'm using TheBloke's WizardCoder-34B and the results are the same as like the Demo WizardLM put up.

lucasjinreal commented 10 months ago

@dred0n how about the quantized model? What inference framework used here? exllama or llama.cpp or hf?