Behavior when missing quantization version

The problem happened below. Turns out it didn't include the "general.quantization_version" metadata. In the case that llama.cpp reads a file without a version, it assumes 2 (grep for the line gguf_set_val_u32(ctx_out, "general.quantization_version", GGML_QNT_VERSION);), so this model works with llama.cpp but fails with rusformers/llm.

model_name = "meta-llama/Llama-2-7b-chat-hf"
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.save_pretrained(local_dir)
    torch.save(model.state_dict(), os.path.join(local_dir, "pytorch_model.bin"))

python llm/crates/ggml/sys/llama-cpp/convert.py models/ --vocab-dir models/ --ctx 4096 --outtype q8_0

let model = llm::load(
                path,
                llm::TokenizerSource::Embedded,
                parameters,
                llm::load_progress_callback_stdout,
            )
            .unwrap_or_else(|err| panic!("Failed to load model: {err}"));

thread '<unnamed>' panicked at llm/inference/src/llms/local/llama2.rs:45:35:
Failed to load model: quantization version was missing, despite model containing quantized tensors

My solution was to just get rid of this whole block

    let any_quantized = gguf
        .tensor_infos
        .values()
        .any(|t| t.element_type.is_quantized());
    // if any_quantized {
    //     match quantization_version {
    //         Some(MetadataValue::UInt32(2)) => {
    //             // Currently supported version
    //         }
    //         Some(quantization_version) => {
    //             return Err(LoadError::UnsupportedQuantizationVersion {
    //                 quantization_version: quantization_version.clone(),
    //             })
    //         }
    //         None => return Err(LoadError::MissingQuantizationVersion),
    //     }
    // }

Unsure how you want to handle this since it does remove a check.

rustformers / llm

Behavior when missing quantization version #447