warmup(Qwen1.5-1.8B-Chat) very slow

makejiang commented 1 month ago

Hi ipex-llm, First time warmup with Qwen1.5-1.8B-Chat took about 20 minutes on my machine. The second time was good.

is it normal?

log:

Code snippet

import torch
import gc
from pathlib import Path

from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
from transformers import AutoTokenizer, LlamaTokenizer

import intel_extension_for_pytorch as ipex

from configs.model_config import MODEL_ROOT_PATH

llm_warmup_prompts = ["what is ai?"]

print(">> NOTE: The one-time warmup may take several minutes. Please be patient until it finishes warm-up...")
print("-"*15, " Start warming-up LLM chatglm3-6b on MTL iGPU ", "-"*15)

model_path = f'{MODEL_ROOT_PATH}/Qwen1.5-1.8B-Chat' # "Qwen2-1.5B" #"chatglm3-6b" #
print(f"model:{model_path}")

if model_path.endswith("int4") or model_path.endswith("4bit"):
    model = AutoModelForCausalLM.load_low_bit(model_path, 
                                            optimize_model=True,
                                            trust_remote_code=True, 
                                            use_cache=True)
else:
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                    load_in_4bit=True,
                                                    optimize_model=True,
                                                    trust_remote_code=True,
                                                    use_cache=True)

model = model.half().to('xpu')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
#tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)

with torch.inference_mode():
    for prompt in llm_warmup_prompts:

        # messages = [
        #     {"role": "system", "content": "You are a helpful assistant."},
        #     {"role": "user", "content": prompt}
        #     ]
        # text = tokenizer.apply_chat_template(
        #     messages,
        #     tokenize=False,
        #     add_generation_prompt=True
        #     )
        # input_ids = tokenizer([text], return_tensors="pt").to("xpu").input_ids

        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, max_new_tokens=32)

print("-"*15, " Warming-up of LLM chatglm3-6b on MTL iGPU is completed (1/4) ", "-"*15)

model.to('cpu')
torch.xpu.synchronize()
torch.xpu.empty_cache()
del model
gc.collect()

environments:

(dev-zone) sit-sku5-9@MTL-TEST C:\Users\SIT-SKU5-9\source\os.linux.ubuntu.cloud.baseos.devzone\_test>env-check.bat
Python 3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.1.0a0+cxx11.abi
-----------------------------------------------------------------
Name: ipex-llm
Version: 2.1.0b20240717
Summary: Large Language Model Develop Toolkit
Home-page: https://github.com/intel-analytics/ipex-llm
Author: BigDL Authors
Author-email: bigdl-user-group@googlegroups.com
License: Apache License, Version 2.0
Location: C:\Users\SIT-SKU5-9\.miniconda_dev_zone\envs\dev-zone\Lib\site-packages
Requires:
Required-by:
-----------------------------------------------------------------
C:\Users\SIT-SKU5-9\.miniconda_dev_zone\envs\dev-zone\Lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\SIT-SKU5-9\.miniconda_dev_zone\envs\dev-zone\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
ipex=2.1.10+xpu
-----------------------------------------------------------------
Total Memory: 31.615 GB

Chip 0 Memory: 4 GB | Speed: 7467 MHz
Chip 1 Memory: 4 GB | Speed: 7467 MHz
Chip 2 Memory: 4 GB | Speed: 7467 MHz
Chip 3 Memory: 4 GB | Speed: 7467 MHz
Chip 4 Memory: 4 GB | Speed: 7467 MHz
Chip 5 Memory: 4 GB | Speed: 7467 MHz
Chip 6 Memory: 4 GB | Speed: 7467 MHz
Chip 7 Memory: 4 GB | Speed: 7467 MHz
-----------------------------------------------------------------
CPU Manufacturer: GenuineIntel
CPU MaxClockSpeed: 1400
CPU Name: Intel(R) Core(TM) Ultra 7 155H
CPU NumberOfCores: 16
CPU NumberOfLogicalProcessors: 22
-----------------------------------------------------------------
GPU 0: Intel(R) Arc(TM) Graphics         Driver Version:  31.0.101.5382
-----------------------------------------------------------------
-----------------------------------------------------------------
System Information

Host Name:                 MTL-TEST
OS Name:                   Microsoft Windows 11 Pro
OS Version:                10.0.22631 N/A Build 22631
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Workstation
OS Build Type:             Multiprocessor Free
Registered Owner:          SIT-SKU5-9
Registered Organization:   N/A
Product ID:                00330-80000-00000-AA767
Original Install Date:     3/12/2024, 1:53:59 AM
System Boot Time:          7/19/2024, 1:29:17 AM
System Manufacturer:       LENOVO
System Model:              INVALID
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 170 Stepping 4 GenuineIntel ~1400 Mhz
BIOS Version:              LENOVO MECN55WW, 12/21/2023
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC+08:00) Beijing, Chongqing, Hong Kong, Urumqi
Total Physical Memory:     32,373 MB
Available Physical Memory: 21,335 MB
Virtual Memory: Max Size:  37,237 MB
Virtual Memory: Available: 21,882 MB
Virtual Memory: In Use:    15,355 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              \\MTL-TEST
Hotfix(s):                 4 Hotfix(s) Installed.
                           [01]: KB5039895
                           [02]: KB5027397
                           [03]: KB5040442
                           [04]: KB5039338
Network Card(s):           2 NIC(s) Installed.
                           [01]: Intel(R) Wi-Fi 6E AX211 160MHz
                                 Connection Name: Wi-Fi
                                 Status:          Media disconnected
                           [02]: ASIX USB to Gigabit Ethernet Family Adapter
                                 Connection Name: Ethernet
                                 DHCP Enabled:    Yes
                                 DHCP Server:     10.239.27.228
                                 IP address(es)
                                 [01]: 10.239.146.211
                                 [02]: fe80::7ae4:dbd7:8710:577f
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) Graphics                                               |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 00000000-0000-0200-0000-00087d558086                                           |
|           | PCI BDF Address: 0000:00:02.0                                                        |
+-----------+--------------------------------------------------------------------------------------+

ACupofAir commented 1 month ago

I used your code to try to reproduce it in the same environment, and found no time anomalies. The warmup time was about 2 seconds.（p.s.: comment the import ipex line)

Here is the result.

Here is the test code:

import torch
import gc
from pathlib import Path

from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
from transformers import AutoTokenizer, LlamaTokenizer

# import intel_extension_for_pytorch as ipex

# from configs.model_config import MODEL_ROOT_PATH
import time

llm_warmup_prompts = ["what is ai?"]

print(">> NOTE: The one-time warmup may take several minutes. Please be patient until it finishes warm-up...")
print("-"*15, " Start warming-up LLM chatglm3-6b on MTL iGPU ", "-"*15)

# model_path = f'{MODEL_ROOT_PATH}/Qwen1.5-1.8B-Chat' # "Qwen2-1.5B" #"chatglm3-6b" #
model_path = '../model_tmp/Qwen1.5-1.8B-Chat'
print(f"model:{model_path}")

if model_path.endswith("int4") or model_path.endswith("4bit"):
    model = AutoModelForCausalLM.load_low_bit(model_path, 
                                            optimize_model=True,
                                            trust_remote_code=True, 
                                            use_cache=True)
else:
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                    load_in_4bit=True,
                                                    optimize_model=True,
                                                    trust_remote_code=True,
                                                    use_cache=True)

model = model.half().to('xpu')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
#tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)

start_time = time.time()
with torch.inference_mode():
    for prompt in llm_warmup_prompts:

        # messages = [
        #     {"role": "system", "content": "You are a helpful assistant."},
        #     {"role": "user", "content": prompt}
        #     ]
        # text = tokenizer.apply_chat_template(
        #     messages,
        #     tokenize=False,
        #     add_generation_prompt=True
        #     )
        # input_ids = tokenizer([text], return_tensors="pt").to("xpu").input_ids

        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, max_new_tokens=32)

end_time = time.time()
delta_time = end_time - start_time
print(f'use {delta_time:.4f}s')
print("-"*15, " Warming-up of LLM chatglm3-6b on MTL iGPU is completed (1/4) ", "-"*15)

model.to('cpu')
torch.xpu.synchronize()
torch.xpu.empty_cache()
del model
gc.collect()

Here is machine hardware parameters:

System: Windows 11 (22H2)
CPU: Ultra 5 125H
Memory: 32GB (7467MHz)
Storage: 1T

makejiang commented 1 month ago

yes, just as I said, it only happened on the first time. Based on my understanding, the GPU kernels require compilation while the first time using a new model. I think it cased the delay and the software stack cached the compilation result for the next time. So, to reproduce the issue, maybe we need a clean system or does anyone know how to clean the cache? Thanks

jason-dai commented 1 month ago

yes, just as I said, it only happened on the first time. Based on my understanding, the GPU kernels require compilation while the first time using a new model. I think it cased the delay and the software stack cached the compilation result for the next time. So, to reproduce the issue, maybe we need a clean system or does anyone know how to clean the cache? Thanks

See https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Overview/FAQ/faq.md#the-first-time-to-run-model-on-meteor-lakes-igpuintel-core-ultra-integrated-gpu-will-takes-5-10-minutes

makejiang commented 1 month ago

Thank @ACupofAir @jason-dai for your support, it seems that there is still no good solution at the moment.

intel-analytics / ipex-llm