Using metal and `n_gpu_layers` produces no tokens

jasonw247 commented 9 months ago

I'm running the example script with a few different models:

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};

pub fn llama_predict() -> Result<String, anyhow::Error> {

    // metal seems to give really bad results
    let model_options = ModelOptions {
          //n_gpu_layers: 1,
        ..Default::default()
    };

    // let model_options = ModelOptions::default();

    let llama = LLama::new(
        "models/mistral-7b-instruct-v0.1.Q4_0.gguf".into(),
        &model_options,
    )
    .unwrap();

    let predict_options = PredictOptions {
        //top_k: 20,
        // top_p: 0.1,
        // f16_kv: true,

        token_callback: Some(Box::new(|token| {
            println!("token: {}", token);
            true
        })),
        ..Default::default()
    };

    // TODO: get this working on master. Metal support is flakey.
    let response = llama
        .predict(
            "what are the national animals of india".into(),
             predict_options,
        )
        .unwrap();
    println!("Response: {}", response);
    Ok(response)
}

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_llama_cpp_rs() -> Result<(), anyhow::Error> {
        let response = llama_predict()?;
        println!("Response: {}", response);
        assert!(!response.is_empty());
        Ok(())
    }
}

When not using metal (not using n_gpu_layers) the models generate tokens ex:

token: ind
token: ian
token:  national
token:  animal
token:  is
token:  t
token: iger
token: 
Response: indian national animal is tiger
Response: indian national animal is tiger

When I use n_gpu_layers it does not generate tokens, ex:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   64.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
count 0
token:
token:
token:
token:
...
Response:
Response:

Is this a known behavior?

pixelspark commented 8 months ago

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

jasonw247 commented 8 months ago

I copied over the necessary metal files, otherwise I would get an error. After copying the files I encountered the no generated tokens issue.

shaqq commented 7 months ago

Is llama.cpp actually using Metal? I tried this and noticed (only after enabling some debug logging) that in fact the file ggml-metal.metal could not be found (it needs to be placed in the current working directory). After this the basic example works just fine for me (and actually uses the GPU) with a Mixtral GGUF model.

AFAIK it does: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#metal-build

shaqq commented 7 months ago

llama-cpp-python requires the user to specify CMAKE_ARGS when during pip install: https://llama-cpp-python.readthedocs.io/en/latest/install/macos/

Do users need to do something similar during cargo install for this crate?

mikecvet commented 7 months ago

Reading through here, it seems like llama.cpp needs to be built with specific flags in order for metal support to work: https://github.com/ggerganov/llama.cpp/pull/1642

mdrokz / rust-llama.cpp

Using metal and `n_gpu_layers` produces no tokens #30