Arc NF4 OOM using the latest code

cyita commented 11 months ago

The bigdl-llm 2.4.0b20231006 generates outputs normally. Not sure if this issue is caused by PR#9066

ENV

bigdl-llm: the main branch (2023.10.7 5:31 PM)

Name: bigdl-core-xe
Version: 2.4.0b20231006
Summary: UNKNOWN
Home-page: UNKNOWN
Author: 
Author-email: 
License: UNKNOWN
Location: /opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages
Requires: 
Required-by:

Name: transformers
Version: 4.31.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft

Error message

Model: llama2-7b Input: "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"

=========First token cost 0.9864 s=========
=========Rest tokens cost average 0.0193 s (31 tokens in all)=========
Traceback (most recent call last):
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/llama_benchmark.py", line 83, in <module>
    output = llama_model.generate(input_ids, do_sample=False, max_new_tokens=max_new_tokens)
  File "/opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/benchmark_util.py", line 1564, in generate
    return self.greedy_search(
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/benchmark_util.py", line 2382, in greedy_search
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/home/arda/yina/llm.cpp/bigdl-core-xe/yina-test/benchmark_util.py", line 528, in prepare_inputs_for_generation
    return self.model.prepare_inputs_for_generation(*args, **kwargs)
  File "/opt/anaconda3/envs/yina-0911/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 861, in prepare_inputs_for_generation
    position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: Allocation is out of device memory on current platform.

hkvision commented 11 months ago

Does this issue happen only for llama2-7b? https://github.com/analytics-zoo/nano/issues/543#issuecomment-1751667183 When I test NF3, I also encounter similar issue when running llama2-7b, but 13b and chatglm works well.

cyita commented 11 months ago

Does this issue happen only for llama2-7b? analytics-zoo/nano#543 (comment) When I test NF3, I also encounter similar issue when running llama2-7b, but 13b and chatglm works well.

NF4 llama2-13b works well.

cyita commented 11 months ago

@yangw1234 Please take a look.

yangw1234 commented 11 months ago

how about this https://github.com/intel-analytics/llm.cpp/pull/112

cyita commented 11 months ago

how about this intel-analytics/llm.cpp#112

This error still exists using NF4.

hkvision commented 11 months ago

I can reproduce this as well:

=========First token cost xxxx s=========
=========Rest tokens cost average xxxx s (31 tokens in all)=========
Traceback (most recent call last):
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/./generate.py", line 88, in <module>
    output = model.generate(input_ids,
  File "/home/arda/anaconda3/envs/kai-llm-pip/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 1564, in generate
    return self.greedy_search(
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 2382, in greedy_search
    model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 528, in prepare_inputs_for_generation
    return self.model.prepare_inputs_for_generation(*args, **kwargs)
  File "/home/arda/anaconda3/envs/kai-llm-pip/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 861, in prepare_inputs_for_generation
    position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: Allocation is out of device memory on current platform.

The warm-up step finishes normally while the second generate gets OOM. @yangw1234 I'm using 2.4.0b20231011

NF3 and INT4 can work normally. llama2-13b is fine for nf4.

Will look into it with @cyita

yangw1234 commented 11 months ago

still cannot reproduce on arc-04. :worried:

hkvision commented 11 months ago

Seems the issue is caused by export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1. Unsetting this variable, llama2-7b can run successfully on arc02/09, which previously fail. But strangely arc-04 works even with this environment variable...

When disabling fuse rope, arc02/09 works well with export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1.

hkvision commented 11 months ago

Checked oneapi version/glibc version/pip packages, no difference between 02 and 04.

intel-analytics / ipex-llm

Arc NF4 OOM using the latest code #9095

ENV

Error message