llama-cli: Could not load model: InvalidMagic { path: ... }

mguinhos commented 1 year ago

Model sucessfully runs on llama.cpp but not in llama-rs

Command:

cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

PS C:\Users\Usuário\Desktop\llama-rs> cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
    Finished release [optimized] target(s) in 2.83s
     Running `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"`
thread 'main' panicked at 'Could not load model: InvalidMagic { path: "C:\\Users\\Usuário\\Downloads\\LLaMA\\7B\\ggml-model-q4_0.bin" }', llama-cli\src\main.rs:147:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"` (exit code: 101)

philpax commented 1 year ago

https://github.com/ggerganov/llama.cpp/pull/252 changed the model format, and we're not compatible with it yet. Thanks for spotting this - we'll need to expedite the fix.

In the meantime, you can re-quantize the model with a version of llama.cpp that predates that, or find a quantized model floating around the internet from before then.

mguinhos commented 1 year ago

Humm... thanks i will try to re-quantize the model to a previous version!

mguinhos commented 1 year ago

ggerganov/llama.cpp#252 changed the model format, and we're not compatible with it yet. Thanks for spotting this - we'll need to expedite the fix.

In the meantime, you can re-quantize the model with a version of llama.cpp that predates that, or find a quantized model floating around the internet from before then.

Got it! Tried with the previous alpaca version that i had!

philpax commented 1 year ago

Great! We'll leave this issue open as a reminder that we'll need to update to handle the new format.

mguinhos commented 1 year ago

Changing the code a bit is sufficient for running the new versioned file formats.

  #[error("file is pre-versioned, generate another please! at {path:?}")]
  PreVersioned { path: PathBuf },
  #[error("invalid magic number for {path:?}")]
  InvalidMagic { path: PathBuf },
  #[error("invalid version number for {path:?}")]
  InvalidVersion { path: PathBuf },

...

// Verify magic
{
    let magic = read_i32(&mut reader)?;
    if magic == 0x67676d6c {
        return Err(LoadError::PreVersioned {
            path: main_path.to_owned(),
        });
    }

    if magic != 0x67676d66 {
        return Err(LoadError::InvalidMagic {
            path: main_path.to_owned(),
        });
    }
}

// Verify the version
{
    let format_version = read_i32(&mut reader)?;
    if format_version != 1 {
        return Err(LoadError::InvalidVersion {
            path: main_path.to_owned(),
        });
    }
}

...

// Load vocabulary
let mut vocab = Vocabulary::default();
for i in 0..hparams.n_vocab {
    let len = read_i32(&mut reader)?;
    if let Ok(word) = read_string(&mut reader, len as usize) {
        vocab.mapping.push(word);

    } else {
        load_progress_callback(LoadProgress::BadToken {
            index: i.try_into()?,
        });
        vocab.mapping.push("�".to_string());
    }

    let score: f32 = read_i32(&mut reader)? as f32;
    vocab.score.push(score);
}

It works without issues. But i dont know if its sufficiente, nothing panicked and it did the inference.

mguinhos commented 1 year ago

I think that the change in the binary format was just the inclusion of the version number, and the score in the load vocabulary. but i am not sure.

ghost commented 1 year ago

After fixing this bug, and adding score: Vec<f32> to the Vocabulary struct, the 7B model works, but 65B does not. It crashes with an allocation error in the ggml library.

mguinhos commented 1 year ago

Related pull request: #61

RoyVorster commented 1 year ago

@mguinhos thanks for referencing. Hadn't even seen the issue. Feel free to modify the PR. Was just running llama-rs for the first time and running into this issue, figured it'd be best to share the small fixes.

RoyVorster commented 1 year ago

Can probably close this issue now?

vv9k commented 1 year ago

I'm using current main branch of llama-rs, got the 7B model and used the python script to convert it and then quantized it with the latest commit of llama.cpp and getting this error:

thread 'main' panicked at 'Could not load model: InvalidMagic { path: "LLaMA/7B/ggml-model-q4_0.bin" }', llama-cli/src/main.rs:206:6

llama.cpp works with the same model:

❯ ./main -m LLaMA/7B/ggml-model-q4_0.bin -p "test"
main: seed = 1680698608
llama_model_load: loading model from 'LLaMA/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from '/home/wojtek/special_downloads/LLaMA/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

 test_suite = True

# The suite of tests to see running.
test_suite( 'TestSuite' ) [end of text]

llama_print_timings:        load time =  1165.30 ms
llama_print_timings:      sample time =    14.24 ms /    27 runs   (    0.53 ms per run)
llama_print_timings: prompt eval time =   763.14 ms /     2 tokens (  381.57 ms per token)
llama_print_timings:        eval time =  4288.87 ms /    26 runs   (  164.96 ms per run)
llama_print_timings:       total time =  5468.84 ms

EDIT:

I managed to get it to work reconverting and requantizing the model with llama.cpp commit 5cb63e2 before change of format. I'm using current main of llama-rs so it should work with the newer format but I'm getting the above error with InvalidMagic

rustformers / llm

llama-cli: Could not load model: InvalidMagic { path: ... } #59