The problem happened below. Turns out it didn't include the "general.quantization_version" metadata. In the case that llama.cpp reads a file without a version, it assumes 2 (grep for the line gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION);), so this model works with llama.cpp but fails with rusformers/llm.
let model = llm::load(
path,
llm::TokenizerSource::Embedded,
parameters,
llm::load_progress_callback_stdout,
)
.unwrap_or_else(|err| panic!("Failed to load model: {err}"));
thread '<unnamed>' panicked at llm/inference/src/llms/local/llama2.rs:45:35:
Failed to load model: quantization version was missing, despite model containing quantized tensors
My solution was to just get rid of this whole block
The problem happened below. Turns out it didn't include the "general.quantization_version" metadata. In the case that llama.cpp reads a file without a version, it assumes 2 (grep for the line
gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION);
), so this model works with llama.cpp but fails with rusformers/llm.My solution was to just get rid of this whole block
Unsure how you want to handle this since it does remove a check.