intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.58k stars 1.25k forks source link

Inference hangs on LNL iGPU with large input prompts. #12158

Open aahouzi opened 1 week ago

aahouzi commented 1 week ago

Type of issue

  output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len, min_new_tokens=out_len, num_beams=num_beams)
repo_id:
  - 'arcee-ai/arcee-lite'
  # - 'arcee-ai/Llama-3.1-SuperNova-Lite'test_api now
local_model_hub: 'C:\Users\Intel\.cache\huggingface\hub\models--arcee-ai--arcee-lite\snapshots\c5cb9c38be16b64757f785f0df36dca87f76d5e2'
warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
num_trials: 1
num_beams: 1 # default to greedy search
low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
batch_size: 1 # default to 1
in_out_pairs:
    # - '6-128'
    # - '6-256'
    - '1000-512'
test_api:
  # - "transformer_int4_fp16_gpu"             # on Intel GPU, transformer-like API, (qtype=int4), (dtype=fp16)
  # - "transformer_int4_fp16_gpu_win"       # on Intel GPU for Windows, transformer-like API, (qtype=int4), (dtype=fp16)
  # - "transformer_int4_gpu"                # on Intel GPU, transformer-like API, (qtype=int4), (dtype=fp32)
  - "transformer_int4_gpu_win"            # on Intel GPU for Windows, transformer-like API, (qtype=int4), (dtype=fp32)
  # - "transformer_int4_loadlowbit_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), use load_low_bit API. Please make sure you have used the save.py to save the converted low bit model
  # - "transformer_int4_fp16_loadlowbit_gpu_win" # on Intel GPU for Windows, transformer-like API, (qtype=int4), (dtype=fp16), use load_low_bit API. Please make sure you have used the save.py to save the converted low bit model
  # - "bigdl_fp16_gpu"                      # on Intel GPU, use ipex-llm transformers API, (dtype=fp16), (qtype=fp16)
  # - "optimize_model_gpu"                  # on Intel GPU, can optimize any pytorch models include transformer model
  # - "deepspeed_optimize_model_gpu"        # on Intel GPU, deepspeed autotp inference
  # - "pipeline_parallel_gpu"               # on Intel GPU, pipeline parallel inference
  # - "speculative_gpu"                     # on Intel GPU, inference with self-speculative decoding
  # - "transformer_int4"                    # on Intel CPU, transformer-like API, (qtype=int4)
  # - "native_int4"                         # on Intel CPU
  # - "optimize_model"                      # on Intel CPU, can optimize any pytorch models include transformer model
  # - "pytorch_autocast_bf16"               # on Intel CPU
  # - "transformer_autocast_bf16"           # on Intel CPU
  # - "bigdl_ipex_bf16"                     # on Intel CPU, (qtype=bf16)
  # - "bigdl_ipex_int4"                     # on Intel CPU, (qtype=int4)
  # - "bigdl_ipex_int8"                     # on Intel CPU, (qtype=int8)
  # - "speculative_cpu"                     # on Intel CPU, inference with self-speculative decoding
  # - "deepspeed_transformer_int4_cpu"      # on Intel CPU, deepspeed autotp inference
  # - "transformers_int4_npu_win"           # on Intel NPU for Windows,  transformer-like API, (qtype=int4)
  # - "transformers_int4_loadlowbit_npu_win" # on Intel NPU for Windows, transformer-like API, (qtype=int4), use load_low_bit API. Please make sure you have used the save_npu.py to save the converted low bit model
cpu_embedding: False # whether put embedding to CPU
streaming: False # whether output in streaming way (only available now for gpu win related test_api)
optimize_model: False # whether apply further optimization on NPU (only available now for transformers_int4_npu_win test_api)
use_fp16_torch_dtype: True # whether use fp16 for non-linear layer (only available now for "pipeline_parallel_gpu" test_api)
task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
transpose_value_cache: True # whether apply transposed v_cache optimization on NPU (only available now for transformers_int4_npu_win test_api)

GPU Driver version

32.0.101.5737

What operating system are you seeing the problem on?

Windows 11

Oscilloscope98 commented 7 hours ago

Hi @aahouzi ,

We recently update ipex-llm for Lunar Lake (LNL) support. You could refer to here regarding how to install ipex-llm for LNL iGPU.

Besides, for all-in-one benchmark, you could also try on test_api transformer_int4_fp16_gpu_win.

Please let us know for any further problems :)