intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.51k stars 1.24k forks source link

Arc NF4 OOM using the latest code #9095

Open cyita opened 11 months ago

cyita commented 11 months ago

The bigdl-llm 2.4.0b20231006 generates outputs normally. Not sure if this issue is caused by PR#9066

ENV

bigdl-llm: the main branch (2023.10.7 5:31 PM)

Name: bigdl-core-xe
Version: 2.4.0b20231006
Summary: UNKNOWN
Home-page: UNKNOWN
Author: 
Author-email: 
License: UNKNOWN
Location: /opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages
Requires: 
Required-by:

Name: transformers
Version: 4.31.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft

Error message

Model: llama2-7b Input: "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"

=========First token cost 0.9864 s=========
=========Rest tokens cost average 0.0193 s (31 tokens in all)=========
Traceback (most recent call last):
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/llama_benchmark.py", line 83, in <module>
    output = llama_model.generate(input_ids, do_sample=False, max_new_tokens=max_new_tokens)
  File "/opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/benchmark_util.py", line 1564, in generate
    return self.greedy_search(
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/benchmark_util.py", line 2382, in greedy_search
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/benchmark_util.py", line 528, in prepare_inputs_for_generation
    return self.model.prepare_inputs_for_generation(*args, **kwargs)
  File "/opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 861, in prepare_inputs_for_generation
    position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: Allocation is out of device memory on current platform.
hkvision commented 11 months ago

Does this issue happen only for llama2-7b? https://github.com/analytics-zoo/nano/issues/543#issuecomment-1751667183 When I test NF3, I also encounter similar issue when running llama2-7b, but 13b and chatglm works well.

cyita commented 11 months ago

Does this issue happen only for llama2-7b? analytics-zoo/nano#543 (comment) When I test NF3, I also encounter similar issue when running llama2-7b, but 13b and chatglm works well.

NF4 llama2-13b works well.

cyita commented 11 months ago

@yangw1234 Please take a look.

yangw1234 commented 11 months ago

how about this https://github.com/intel-analytics/llm.cpp/pull/112

cyita commented 11 months ago

how about this intel-analytics/llm.cpp#112

This error still exists using NF4.

hkvision commented 11 months ago

I can reproduce this as well:

=========First token cost xxxx s=========
=========Rest tokens cost average xxxx s (31 tokens in all)=========
Traceback (most recent call last):
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/./generate.py", line 88, in <module>
    output = model.generate(input_ids,
  File "/home/arda/anaconda3/envs/kai-llm-pip/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 1564, in generate
    return self.greedy_search(
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 2382, in greedy_search
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 528, in prepare_inputs_for_generation
    return self.model.prepare_inputs_for_generation(*args, **kwargs)
  File "/home/arda/anaconda3/envs/kai-llm-pip/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 861, in prepare_inputs_for_generation
    position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: Allocation is out of device memory on current platform.

The warm-up step finishes normally while the second generate gets OOM. @yangw1234 I'm using 2.4.0b20231011

NF3 and INT4 can work normally. llama2-13b is fine for nf4.

Will look into it with @cyita

yangw1234 commented 11 months ago

still cannot reproduce on arc-04. :worried:

hkvision commented 11 months ago

Seems the issue is caused by export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1. Unsetting this variable, llama2-7b can run successfully on arc02/09, which previously fail. But strangely arc-04 works even with this environment variable...

When disabling fuse rope, arc02/09 works well with export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1.

hkvision commented 11 months ago

Checked oneapi version/glibc version/pip packages, no difference between 02 and 04.