Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k
stars
1.24k
forks
source link
Unable to run chatglm3 speculative with IPEX220 #10233
Traceback (most recent call last):
File "/speculative/chatglm3/./speculative.py", line 62, in
model = model.to('cpu')
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2460, in to
return super().to(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 990 more times]
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 824, in _apply
with torch.no_grad():
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 81, in __enter__
torch.set_grad_enabled(False)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 184, in __init__
self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
Then deleting the code model = model.to('cpu'), I got another error.
View output
Traceback (most recent call last):
File "/speculative/chatglm3/./speculative.py", line 72, in
output = model.generate(input_ids,
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/speculative.py", line 65, in generate
return self.speculative_generate(inputs=inputs,
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/speculative.py", line 635, in speculative_generate
output = self.trace_graph(input_ids=drafted_input_ids,
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 1520, in forward
outputs = self.optimized_model(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
RuntimeError: 0 INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":615, please report a bug to PyTorch. We don't have an op for aten::unsqueeze but it isn't a special case. Argument types: Tensor, int,
Candidates:
aten::unsqueeze(Tensor(a) self, int dim) -> Tensor(a)
Docker image
intelanalytics/bigdl-llm-cpu:2.5.0-SNAPSHOT
conda env
bigdl-speculative-py39
model
chatglm3-6b
ipex version
IPEX 2.2.0+cpu
With export BIGDL_OPT_IPEX=true, the output of chatglm3-6b speculative decoding is the following.
View output
Add
torchscript=True
in https://github.com/intel-analytics/BigDL/blob/2a1ded79e77068db2e91b4d4b080f4c2b13a472c/python/llm/example/CPU/Speculative-Decoding/chatglm3/speculative.py#L54View output
Then deleting the code
model = model.to('cpu')
, I got another error.View output
Please take a look, @xiangyuT.