rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
https://docs.rs/llm/latest/llm/
Apache License 2.0
6.06k stars 350 forks source link

Why is the feed_prompt process so slow? #439

Open zackshen opened 8 months ago

zackshen commented 8 months ago

LLM is indeed a fantastic library and very easy to use. However, after using LLM for a few days, I noticed that the process of feed_prompt is always very slow. It consumes a significant amount of CPU resources and doesn't utilize GPU resources (I found in the hardware acceleration documentation that feed_prompt currently doesn't use GPU resources). As a result, if I add some context during the conversation, it takes a long time to wait for feed_prompt to complete, which is not ideal for the actual user experience. I used TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin for testing.

Using the same model and prompt, I tested with llama.cpp, and its first token response time is very fast. I'm not sure what the difference is in the feed_prompt process between llm and llama.cpp. By observing CPU history and GPU history,It seems like llama.cpp is fully utilizing the GPU for inference.

Can you please help me identify what's wrong?

Model:

  1. TheBloke/Llama-2-7B-Chat-GGML/llama-2-7b-chat.ggmlv3.q2_K.bin

System:

  1. Apple 2020 M1 16GB
  2. MacOS 13.6.1 (22G313)

llama.cpp command:

./main -m {{MODEL_PATH}}  -p "[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"

llama.cpp Result:

llama_print_timings:        load time =     473.17 ms
llama_print_timings:      sample time =      49.00 ms /   144 runs   (    0.34 ms per token,  2938.90 tokens per second)
llama_print_timings: prompt eval time =    1460.21 ms /   155 tokens (    9.42 ms per token,   106.15 tokens per second)
llama_print_timings:        eval time =   11099.90 ms /   143 runs   (   77.62 ms per token,    12.88 tokens per second)
llama_print_timings:       total time =   12666.70 ms

llm sample code:

const DEFAULT_PROMPT: &'static str = r#"[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

[/INST]

[INST] What is the largest animal in the world ? [/INST]
"#;

    let model_path = PathBuf::from(MODEL_FILE);
    let model = llm::load_dynamic(
        Some(llm::ModelArchitecture::Llama),
        &model_path,
        llm::TokenizerSource::Embedded,
        llm::ModelParameters {
            prefer_mmap: true,
            use_gpu: true,
            ..Default::default()
        },
        llm::load_progress_callback_stdout,
    )
    .unwrap();

    let session_config = InferenceSessionConfig {
        n_batch: 512,
        ..Default::default()
    };
    let mut session = model.start_session(session_config);
    let mut rng = rand::thread_rng();
    let mut output_request = llm::OutputRequest::default();
    let sampler = Arc::new(Mutex::new(
        SamplerChain::<u32, f32>::new()
            + SampleTemperature::new(0.2)
            + SampleTopK::new(40, 40)
            + SampleTopP::new(0.95, 40)
            + SampleRandDistrib::new(),
    ));
    let params = llm::InferenceParameters { sampler };
    let ts = Instant::now();
    let mut first_token_time: Option<f32> = None;
    let ret = session
        .infer::<Infallible>(
            model.as_ref(),
            &mut rng,
            &llm::InferenceRequest {
                prompt: llm::Prompt::Text(DEFAULT_PROMPT),
                parameters: &params,
                play_back_previous_tokens: false,
                maximum_token_count: Some(1500),
            },
            &mut output_request,
            llm::conversation_inference_callback("[INST]", |t| {
                if first_token_time.is_none() {
                    first_token_time = Some(ts.elapsed().as_secs_f32());
                }
                print_token(t)
            }),
        )
        .unwrap();
    println!("{stats:#?}", stats = ret,);
    println!("first time to token: {first_token_time:?}");
    println!("token count {:?}", ret.prompt_tokens + ret.predict_tokens);
    println!(
        "prompt token speed {:?}/s",
        ret.prompt_tokens as f32 / ret.feed_prompt_duration.as_secs_f32()
    );
    println!(
        "predict token speed {:?}/s",
        ret.predict_tokens as f32 / ret.predict_duration.as_secs_f32()
    );
    println!(
        "summary speed {:?}/s",
        (ret.predict_tokens + ret.prompt_tokens) as f32
            / (ret.predict_duration.as_secs_f32() + ret.feed_prompt_duration.as_secs_f32())
    );

llm sample code result:

InferenceStats {
    feed_prompt_duration: 10.74704s,
    prompt_tokens: 155,
    predict_duration: 28.863045s,
    predict_tokens: 397,
}
first time to token: Some(11.22408)
token count 552
prompt token speed 14.422576/s
predict token speed 13.754613/s
summary speed 13.935845/s
philpax commented 8 months ago

Hey there! Thanks for reporting this and providing lots of detail :)

The issue here is that the version of GGML we use doesn’t support a specific operation required for feeding more than one token at a time with Metal (i.e. this works fine with CUDA, not Metal). See also #403.

This has been fixed in upstream GGML/llama.cpp, but we haven’t integrated that fix yet. The work has started in #428 and that should hopefully be finished within the next week (I’m out of town but I hope to get back to it soon).

Hope that helps clarify the state of affairs!

zackshen commented 8 months ago

I'm very happy to hear this news and looking forward to the merged version. Thank you for your work.

Can I wait until after the release to close this issue?

zackshen commented 7 months ago

hello @philpax has there been any recent movement on this?

philpax commented 7 months ago

I started working on it, but realised that it would end up being quite a large task. Still working on it, but it'll take some time.

zackshen commented 7 months ago

thanks