Open cyita opened 11 months ago
Does this issue happen only for llama2-7b? https://github.com/analytics-zoo/nano/issues/543#issuecomment-1751667183 When I test NF3, I also encounter similar issue when running llama2-7b, but 13b and chatglm works well.
Does this issue happen only for llama2-7b? analytics-zoo/nano#543 (comment) When I test NF3, I also encounter similar issue when running llama2-7b, but 13b and chatglm works well.
NF4 llama2-13b works well.
@yangw1234 Please take a look.
how about this https://github.com/intel-analytics/llm.cpp/pull/112
how about this intel-analytics/llm.cpp#112
This error still exists using NF4.
I can reproduce this as well:
=========First token cost xxxx s=========
=========Rest tokens cost average xxxx s (31 tokens in all)=========
Traceback (most recent call last):
File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/./generate.py", line 88, in <module>
output = model.generate(input_ids,
File "/home/arda/anaconda3/envs/kai-llm-pip/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 1564, in generate
return self.greedy_search(
File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 2382, in greedy_search
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
File "/home/arda/kai/BigDL/python/llm/example/gpu/hf-transformers-models/llama2/benchmark_util.py", line 528, in prepare_inputs_for_generation
return self.model.prepare_inputs_for_generation(*args, **kwargs)
File "/home/arda/anaconda3/envs/kai-llm-pip/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 861, in prepare_inputs_for_generation
position_ids = attention_mask.long().cumsum(-1) - 1
RuntimeError: Allocation is out of device memory on current platform.
The warm-up step finishes normally while the second generate gets OOM. @yangw1234 I'm using 2.4.0b20231011
NF3 and INT4 can work normally. llama2-13b is fine for nf4.
Will look into it with @cyita
still cannot reproduce on arc-04. :worried:
Seems the issue is caused by export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
.
Unsetting this variable, llama2-7b can run successfully on arc02/09, which previously fail.
But strangely arc-04 works even with this environment variable...
When disabling fuse rope, arc02/09 works well with export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
.
Checked oneapi version/glibc version/pip packages, no difference between 02 and 04.
The
bigdl-llm 2.4.0b20231006
generates outputs normally. Not sure if this issue is caused by PR#9066ENV
bigdl-llm: the main branch (2023.10.7 5:31 PM)
Error message
Model: llama2-7b Input: "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"