Closed mguinhos closed 1 year ago
https://github.com/ggerganov/llama.cpp/pull/252 changed the model format, and we're not compatible with it yet. Thanks for spotting this - we'll need to expedite the fix.
In the meantime, you can re-quantize the model with a version of llama.cpp
that predates that, or find a quantized model floating around the internet from before then.
Humm... thanks i will try to re-quantize the model to a previous version!
ggerganov/llama.cpp#252 changed the model format, and we're not compatible with it yet. Thanks for spotting this - we'll need to expedite the fix.
In the meantime, you can re-quantize the model with a version of
llama.cpp
that predates that, or find a quantized model floating around the internet from before then.
Got it! Tried with the previous alpaca version that i had!
Great! We'll leave this issue open as a reminder that we'll need to update to handle the new format.
Changing the code a bit is sufficient for running the new versioned file formats.
#[error("file is pre-versioned, generate another please! at {path:?}")]
PreVersioned { path: PathBuf },
#[error("invalid magic number for {path:?}")]
InvalidMagic { path: PathBuf },
#[error("invalid version number for {path:?}")]
InvalidVersion { path: PathBuf },
...
// Verify magic
{
let magic = read_i32(&mut reader)?;
if magic == 0x67676d6c {
return Err(LoadError::PreVersioned {
path: main_path.to_owned(),
});
}
if magic != 0x67676d66 {
return Err(LoadError::InvalidMagic {
path: main_path.to_owned(),
});
}
}
// Verify the version
{
let format_version = read_i32(&mut reader)?;
if format_version != 1 {
return Err(LoadError::InvalidVersion {
path: main_path.to_owned(),
});
}
}
...
// Load vocabulary
let mut vocab = Vocabulary::default();
for i in 0..hparams.n_vocab {
let len = read_i32(&mut reader)?;
if let Ok(word) = read_string(&mut reader, len as usize) {
vocab.mapping.push(word);
} else {
load_progress_callback(LoadProgress::BadToken {
index: i.try_into()?,
});
vocab.mapping.push("�".to_string());
}
let score: f32 = read_i32(&mut reader)? as f32;
vocab.score.push(score);
}
It works without issues. But i dont know if its sufficiente, nothing panicked and it did the inference.
I think that the change in the binary format was just the inclusion of the version number
, and the score
in the load vocabulary. but i am not sure.
After fixing this bug, and adding score: Vec<f32>
to the Vocabulary struct, the 7B model works, but 65B does not. It crashes with an allocation error in the ggml library.
Related pull request: #61
@mguinhos thanks for referencing. Hadn't even seen the issue. Feel free to modify the PR. Was just running llama-rs
for the first time and running into this issue, figured it'd be best to share the small fixes.
Can probably close this issue now?
I'm using current main
branch of llama-rs, got the 7B model and used the python script to convert it and then quantized it with the latest commit of llama.cpp and getting this error:
thread 'main' panicked at 'Could not load model: InvalidMagic { path: "LLaMA/7B/ggml-model-q4_0.bin" }', llama-cli/src/main.rs:206:6
llama.cpp works with the same model:
❯ ./main -m LLaMA/7B/ggml-model-q4_0.bin -p "test"
main: seed = 1680698608
llama_model_load: loading model from 'LLaMA/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size = 81.25 KB
llama_model_load: mem required = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from '/home/wojtek/special_downloads/LLaMA/7B/ggml-model-q4_0.bin'
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0
test_suite = True
# The suite of tests to see running.
test_suite( 'TestSuite' ) [end of text]
llama_print_timings: load time = 1165.30 ms
llama_print_timings: sample time = 14.24 ms / 27 runs ( 0.53 ms per run)
llama_print_timings: prompt eval time = 763.14 ms / 2 tokens ( 381.57 ms per token)
llama_print_timings: eval time = 4288.87 ms / 26 runs ( 164.96 ms per run)
llama_print_timings: total time = 5468.84 ms
EDIT:
I managed to get it to work reconverting and requantizing the model with llama.cpp commit 5cb63e2 before change of format. I'm using current main of llama-rs so it should work with the newer format but I'm getting the above error with InvalidMagic
Model sucessfully runs on
llama.cpp
but not inllama-rs
Command: