intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

warmup(Qwen1.5-1.8B-Chat) very slow #11641

Closed makejiang closed 1 month ago

makejiang commented 1 month ago

Hi ipex-llm, First time warmup with Qwen1.5-1.8B-Chat took about 20 minutes on my machine. The second time was good.

is it normal?

log: image

Code snippet

import torch
import gc
from pathlib import Path

from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
from transformers import AutoTokenizer, LlamaTokenizer

import intel_extension_for_pytorch as ipex

from configs.model_config import MODEL_ROOT_PATH

llm_warmup_prompts = ["what is ai?"]

print(">> NOTE: The one-time warmup may take several minutes. Please be patient until it finishes warm-up...")
print("-"*15, " Start warming-up LLM chatglm3-6b on MTL iGPU ", "-"*15)

model_path = f'{MODEL_ROOT_PATH}/Qwen1.5-1.8B-Chat' # "Qwen2-1.5B" #"chatglm3-6b" #
print(f"model:{model_path}")

if model_path.endswith("int4") or model_path.endswith("4bit"):
    model = AutoModelForCausalLM.load_low_bit(model_path, 
                                            optimize_model=True,
                                            trust_remote_code=True, 
                                            use_cache=True)
else:
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                    load_in_4bit=True,
                                                    optimize_model=True,
                                                    trust_remote_code=True,
                                                    use_cache=True)

model = model.half().to('xpu')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
#tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)

with torch.inference_mode():
    for prompt in llm_warmup_prompts:

        # messages = [
        #     {"role": "system", "content": "You are a helpful assistant."},
        #     {"role": "user", "content": prompt}
        #     ]
        # text = tokenizer.apply_chat_template(
        #     messages,
        #     tokenize=False,
        #     add_generation_prompt=True
        #     )
        # input_ids = tokenizer([text], return_tensors="pt").to("xpu").input_ids

        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, max_new_tokens=32)

print("-"*15, " Warming-up of LLM chatglm3-6b on MTL iGPU is completed (1/4) ", "-"*15)

model.to('cpu')
torch.xpu.synchronize()
torch.xpu.empty_cache()
del model
gc.collect()

environments:

(dev-zone) sit-sku5-9@MTL-TEST C:\Users\SIT-SKU5-9\source\os.linux.ubuntu.cloud.baseos.devzone\_test>env-check.bat
Python 3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.1.0a0+cxx11.abi
-----------------------------------------------------------------
Name: ipex-llm
Version: 2.1.0b20240717
Summary: Large Language Model Develop Toolkit
Home-page: https://github.com/intel-analytics/ipex-llm
Author: BigDL Authors
Author-email: bigdl-user-group@googlegroups.com
License: Apache License, Version 2.0
Location: C:\Users\SIT-SKU5-9\.miniconda_dev_zone\envs\dev-zone\Lib\site-packages
Requires:
Required-by:
-----------------------------------------------------------------
C:\Users\SIT-SKU5-9\.miniconda_dev_zone\envs\dev-zone\Lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\SIT-SKU5-9\.miniconda_dev_zone\envs\dev-zone\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
ipex=2.1.10+xpu
-----------------------------------------------------------------
Total Memory: 31.615 GB

Chip 0 Memory: 4 GB | Speed: 7467 MHz
Chip 1 Memory: 4 GB | Speed: 7467 MHz
Chip 2 Memory: 4 GB | Speed: 7467 MHz
Chip 3 Memory: 4 GB | Speed: 7467 MHz
Chip 4 Memory: 4 GB | Speed: 7467 MHz
Chip 5 Memory: 4 GB | Speed: 7467 MHz
Chip 6 Memory: 4 GB | Speed: 7467 MHz
Chip 7 Memory: 4 GB | Speed: 7467 MHz
-----------------------------------------------------------------
CPU Manufacturer: GenuineIntel
CPU MaxClockSpeed: 1400
CPU Name: Intel(R) Core(TM) Ultra 7 155H
CPU NumberOfCores: 16
CPU NumberOfLogicalProcessors: 22
-----------------------------------------------------------------
GPU 0: Intel(R) Arc(TM) Graphics         Driver Version:  31.0.101.5382
-----------------------------------------------------------------
-----------------------------------------------------------------
System Information

Host Name:                 MTL-TEST
OS Name:                   Microsoft Windows 11 Pro
OS Version:                10.0.22631 N/A Build 22631
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Workstation
OS Build Type:             Multiprocessor Free
Registered Owner:          SIT-SKU5-9
Registered Organization:   N/A
Product ID:                00330-80000-00000-AA767
Original Install Date:     3/12/2024, 1:53:59 AM
System Boot Time:          7/19/2024, 1:29:17 AM
System Manufacturer:       LENOVO
System Model:              INVALID
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 170 Stepping 4 GenuineIntel ~1400 Mhz
BIOS Version:              LENOVO MECN55WW, 12/21/2023
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory:     32,373 MB
Available Physical Memory: 21,335 MB
Virtual Memory: Max Size:  37,237 MB
Virtual Memory: Available: 21,882 MB
Virtual Memory: In Use:    15,355 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              \\MTL-TEST
Hotfix(s):                 4 Hotfix(s) Installed.
                           [01]: KB5039895
                           [02]: KB5027397
                           [03]: KB5040442
                           [04]: KB5039338
Network Card(s):           2 NIC(s) Installed.
                           [01]: Intel(R) Wi-Fi 6E AX211 160MHz
                                 Connection Name: Wi-Fi
                                 Status:          Media disconnected
                           [02]: ASIX USB to Gigabit Ethernet Family Adapter
                                 Connection Name: Ethernet
                                 DHCP Enabled:    Yes
                                 DHCP Server:     10.239.27.228
                                 IP address(es)
                                 [01]: 10.239.146.211
                                 [02]: fe80::7ae4:dbd7:8710:577f
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) Graphics                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 00000000-0000-0200-0000-00087d558086                                           |
|           | PCI BDF Address: 0000:00:02.0                                                        |
+-----------+--------------------------------------------------------------------------------------+
ACupofAir commented 1 month ago

I used your code to try to reproduce it in the same environment, and found no time anomalies. The warmup time was about 2 seconds.(p.s.: comment the import ipex line)

Here is the result.

image

Here is the test code:

import torch
import gc
from pathlib import Path

from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
from transformers import AutoTokenizer, LlamaTokenizer

# import intel_extension_for_pytorch as ipex

# from configs.model_config import MODEL_ROOT_PATH
import time

llm_warmup_prompts = ["what is ai?"]

print(">> NOTE: The one-time warmup may take several minutes. Please be patient until it finishes warm-up...")
print("-"*15, " Start warming-up LLM chatglm3-6b on MTL iGPU ", "-"*15)

# model_path = f'{MODEL_ROOT_PATH}/Qwen1.5-1.8B-Chat' # "Qwen2-1.5B" #"chatglm3-6b" #
model_path = '../model_tmp/Qwen1.5-1.8B-Chat'
print(f"model:{model_path}")

if model_path.endswith("int4") or model_path.endswith("4bit"):
    model = AutoModelForCausalLM.load_low_bit(model_path, 
                                            optimize_model=True,
                                            trust_remote_code=True, 
                                            use_cache=True)
else:
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                    load_in_4bit=True,
                                                    optimize_model=True,
                                                    trust_remote_code=True,
                                                    use_cache=True)

model = model.half().to('xpu')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
#tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)

start_time = time.time()
with torch.inference_mode():
    for prompt in llm_warmup_prompts:

        # messages = [
        #     {"role": "system", "content": "You are a helpful assistant."},
        #     {"role": "user", "content": prompt}
        #     ]
        # text = tokenizer.apply_chat_template(
        #     messages,
        #     tokenize=False,
        #     add_generation_prompt=True
        #     )
        # input_ids = tokenizer([text], return_tensors="pt").to("xpu").input_ids

        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, max_new_tokens=32)

end_time = time.time()
delta_time = end_time - start_time
print(f'use {delta_time:.4f}s')
print("-"*15, " Warming-up of LLM chatglm3-6b on MTL iGPU is completed (1/4) ", "-"*15)

model.to('cpu')
torch.xpu.synchronize()
torch.xpu.empty_cache()
del model
gc.collect()

Here is machine hardware parameters:

makejiang commented 1 month ago

yes, just as I said, it only happened on the first time. Based on my understanding, the GPU kernels require compilation while the first time using a new model. I think it cased the delay and the software stack cached the compilation result for the next time. So, to reproduce the issue, maybe we need a clean system or does anyone know how to clean the cache? Thanks

jason-dai commented 1 month ago

yes, just as I said, it only happened on the first time. Based on my understanding, the GPU kernels require compilation while the first time using a new model. I think it cased the delay and the software stack cached the compilation result for the next time. So, to reproduce the issue, maybe we need a clean system or does anyone know how to clean the cache? Thanks

See https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Overview/FAQ/faq.md#the-first-time-to-run-model-on-meteor-lakes-igpuintel-core-ultra-integrated-gpu-will-takes-5-10-minutes

makejiang commented 1 month ago

Thank @ACupofAir @jason-dai for your support, it seems that there is still no good solution at the moment.