No acceleration in example code on CPU

forcekeng commented 8 months ago

I try the code at file python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral/generate.py, there's no acceleration on my CPU. I try it on Windows and Windows-WSL both. But both the llm-cli and original llama.cpp speeds up the model about 10x faster on my CPU. Is there anyone know how the given code could run faster?

Model tested: llama_7b, llama_7b_chat, mistral_7b_intruct_v0.2. CPU: 12th Gen Intel(R) Core(TM) i7-12700KF Memory: 32G

Test code:

import torch
import time
import argparse

from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Convert model to 4 bit
model_path = "../model/Misral_7B_Instruct_v0.2/"
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True,
                                             use_safetensors=True)
save_dir = "../model/Misral_7B_Instruct_v0.2_4bit"
model.save_low_bit(save_dir)
del(model)

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.load_low_bit(save_dir)

prompt = """<s>[INST] You are a helpful assistant and know everthing. [/INST]
ok, I know everything in the world, ask me anything you are interesting.</s>
[INST] {text} [/INST]"""

# inference
with torch.inference_mode():
    while True:
        input_text = input()
        if input_text == "stop":
            break
        input_text = prompt.format(text=input_text)
        input_ids = tokenizer.encode(input_text, return_tensors="pt")
        st = time.time()
        output = model.generate(input_ids, max_new_tokens=256)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Output', '-'*20)
        print(output_str)

The output:

Can you tell me how to improve my oral English, my oral English is not good at all.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Inference time: 35.75278043746948 s
-------------------- Output --------------------
[INST] You are a helpful assistant and know everthing. [/INST]
ok, I know everything in the world, ask me anything you are interesting. 
[INST] Can you tell me how to improve my oral English, my oral English is not good at all. [/INST] Of course! Here are some tips to help improve your oral English:

1. Practice regularly: The more you practice speaking English, the more comfortable and confident you will become. Try to find opportunities to speak English every day, such as with native speakers, language exchange partners, or even by speaking to yourself.
2. Listen to English media: Listening to English music, podcasts, or watching English movies and TV shows can help you improve your listening skills and expand your vocabulary.
3. Learn grammar rules: Understanding English grammar rules can help you construct sentences more accurately and fluently.
4. Use flashcards: Flashcards can help you memorize new vocabulary words and phrases.
5. Speak slowly and clearly: Speaking slowly and clearly can help you improve your pronunciation and make yourself more easily understood.
6. Repeat after native speakers: Repeating after native speakers can help you improve your pronunciation and intonation.
7. Use a dictionary: Using a dictionary can help you look up new words and phrases and learn their meanings and usage.
8. Practice in real-life situations: Practicing English in real-life situations, such as ordering food at a

Zhengjin-Wang commented 7 months ago

Hi, we test the same code and Misral-7B-Instruct-v0.2 model in i9-13900K and 64GB memory, the inference time is 17.881616592407227s. For Llama-2-7b it's 19.197434425354004s.

Zhengjin-Wang commented 7 months ago

And we test llm-cli for Llama-2-7b using same prompt and max-output-token in same i9 machine, the time cost is 14.29773s.

Zhengjin-Wang commented 7 months ago

Maybe you can check your bigdl-llm version, or try it in a docker container and check if there any difference in environment variables between container and your environment. I used image intelanalytics/bigdl-llm-cpu:2.5.0-SNAPSHOT for testing.

forcekeng commented 7 months ago

And we test llm-cli for Llama-2-7b using same prompt and max-output-token in same i9 machine, the time cost is 14.29773s.

OK, thanks for your reply. I'll check it and compare the python code with llm-cli and llama.cpp.

intel-analytics / ipex-llm

No acceleration in example code on CPU #9893