turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Got error when converting Llama2 70b #156

Closed GolemXlV closed 9 months ago

GolemXlV commented 9 months ago

Hi, When trying to convert llama 2 70b on custom calibration dataset on RTX 3090 (24gb) an error fails:

$ python convert.py \
    -i /models/llama2-70b-chat-hf \
    -o /models/temp/exl2 \
    -c /data/reddit_finance_43_250k.parquet \
    -cf /models/output/llama2-70b-chat-hf-exl2/2.5bpw/ \
    -b 2.5
 -- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: /models/llama2-70b-chat-hf
 -- Output: /models/temp/exl2
 -- Calibration dataset: /data/reddit_finance_43_250k.parquet, 100 / 16 (0) rows, 2048 tokens per sample
 -- Target bits per weight: 2.5 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: ./models/output/llama2-70b-chat-hf-exl2/2.5bpw/
 -- Quantizing...
 -- Layer: model.layers.66 (Attention)
Traceback (most recent call last):
  File "/exllama2/convert.py", line 280, in <module>
    quant(job, save_job, model)
  File "/exllamav2/conversion/quantize.py", line 506, in quant
    with safe_open(in_name, framework = "pt", device = "cpu" if page_rows else "cuda:0") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooSmall

Tried to re-download the model again, but it doesn't help =( The model folder structure:

614 ./models/llama2-70b-chat-hf/config.json
188 ./models/llama2-70b-chat-hf/generation_config.json
9852591960 ./models/llama2-70b-chat-hf/model-00001-of-00015.safetensors
9798099016 ./models/llama2-70b-chat-hf/model-00002-of-00015.safetensors
9965870512 ./models/llama2-70b-chat-hf/model-00003-of-00015.safetensors
9798066064 ./models/llama2-70b-chat-hf/model-00004-of-00015.safetensors
9798099064 ./models/llama2-70b-chat-hf/model-00005-of-00015.safetensors
9798099056 ./models/llama2-70b-chat-hf/model-00006-of-00015.safetensors
9965870512 ./models/llama2-70b-chat-hf/model-00007-of-00015.safetensors
9798066064 ./models/llama2-70b-chat-hf/model-00008-of-00015.safetensors
9798099064 ./models/llama2-70b-chat-hf/model-00009-of-00015.safetensors
9798099056 ./models/llama2-70b-chat-hf/model-00010-of-00015.safetensors
9965870512 ./models/llama2-70b-chat-hf/model-00011-of-00015.safetensors
9798066064 ./models/llama2-70b-chat-hf/model-00012-of-00015.safetensors
9798099064 ./models/llama2-70b-chat-hf/model-00013-of-00015.safetensors
9496124816 ./models/llama2-70b-chat-hf/model-00014-of-00015.safetensors
524288128 ./models/llama2-70b-chat-hf/model-00015-of-00015.safetensors
1618 ./models/llama2-70b-chat-hf/tokenizer_config.json
1842767 ./models/llama2-70b-chat-hf/tokenizer.json
499723 ./models/llama2-70b-chat-hf/tokenizer.model
turboderp commented 9 months ago

The setup looks okay, and the fact that it failed on layer 66 suggests it's quantizing alright. It does look like the job was corrupted, though, since it's failing to load the hidden state checkpoint. My guess would be that there's an input_states.safetensors file in the work directory that has a size of zero, or something along those lines. Hard to say why this happened, though. Maybe you're low on disk space or system memory?

Sadly if that is the case I'm not sure there's a way to recover the job. It looks like you might need to start over. You should be able to save the measurement.json file from the work directory, though. Pass it to the quantizer with -m along with an empty work directory (-o) and you can skip the measurement step at least.

GolemXlV commented 9 months ago

Yeah, you're right, thanks.

1638488 models/temp/exl2/cal_data.safetensors
0 models/temp/exl2/input_states.safetensors
8860825 models/temp/exl2/job.json
8221517 models/temp/exl2/measurement.json
36864 models/temp/exl2/out_tensor

I add measurement.json and it's finally finished successfully. I'm not sure what is really happened, but looks like the problem was on my side.

Awesome work, by the way!