intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.49k stars 1.24k forks source link

Unable to run chatglm3 speculative with IPEX220 #10233

Closed Jasonzzt closed 6 months ago

Jasonzzt commented 6 months ago

Docker image

intelanalytics/bigdl-llm-cpu:2.5.0-SNAPSHOT

conda env

bigdl-speculative-py39

model

chatglm3-6b

ipex version

IPEX 2.2.0+cpu

With export BIGDL_OPT_IPEX=true, the output of chatglm3-6b speculative decoding is the following.

View output
Traceback (most recent call last):
  File "/speculative/chatglm3/./speculative.py", line 54, in 
    model = AutoModel.from_pretrained(model_path,
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/model.py", line 300, in from_pretrained
    model = cls.load_convert(q_k, optimize_model, *args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/model.py", line 419, in load_convert
    model = ggml_convert_low_bit(model, qtype, optimize_model,
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/convert.py", line 558, in ggml_convert_low_bit
    model = _optimize_ipex(model)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/convert.py", line 660, in _optimize_ipex
    return _ipex_jit(model)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/convert_ipex.py", line 117, in _ipex_jit
    trace_model = torch.jit.trace(
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/intel_extension_for_pytorch/jit/_trace.py", line 69, in wrapper
    traced = f(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/jit/_trace.py", line 806, in trace
    return trace_module(
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/jit/_trace.py", line 1062, in trace_module
    module._c._create_method_from_trace_with_dict(
RuntimeError: Tracer cannot infer type of CausalLMOutputWithPast(loss=None, logits=tensor([[[-3.9844, -3.9844, -1.1719,  ..., -3.9844, -3.9844, -3.9844],
         [-3.9844, -3.9844, -1.1719,  ..., -3.9844, -3.9844, -3.9844],
         [-3.9844, -3.9844, -1.1641,  ..., -3.9844, -3.9844, -3.9844],
         ...,
         [-3.9844, -3.9844, -1.2031,  ..., -3.9844, -3.9844, -3.9844],
         [-3.9844, -3.9844, -1.2109,  ..., -3.9844, -3.9844, -3.9844],
         [-3.9688, -3.9688, -1.2031,  ..., -3.9688, -3.9688, -3.9688]]],
       dtype=torch.bfloat16), past_key_values=((tensor([[[[         1535270912],
          [               8135],

Add torchscript=True in https://github.com/intel-analytics/BigDL/blob/2a1ded79e77068db2e91b4d4b080f4c2b13a472c/python/llm/example/CPU/Speculative-Decoding/chatglm3/speculative.py#L54

View output
Traceback (most recent call last):
  File "/speculative/chatglm3/./speculative.py", line 62, in 
    model = model.to('cpu')
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2460, in to
    return super().to(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  [Previous line repeated 990 more times]
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 824, in _apply
    with torch.no_grad():
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 81, in __enter__
    torch.set_grad_enabled(False)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 184, in __init__
    self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object

Then deleting the code model = model.to('cpu'), I got another error.

View output
Traceback (most recent call last):
  File "/speculative/chatglm3/./speculative.py", line 72, in 
    output = model.generate(input_ids,
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/speculative.py", line 65, in generate
    return self.speculative_generate(inputs=inputs,
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/bigdl/llm/transformers/speculative.py", line 635, in speculative_generate
    output = self.trace_graph(input_ids=drafted_input_ids,
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/intel_extension_for_pytorch/transformers/models/reference/models.py", line 1520, in forward
    outputs = self.optimized_model(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/bigdl-speculative-py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: 0 INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":615, please report a bug to PyTorch. We don't have an op for aten::unsqueeze but it isn't a special case.  Argument types: Tensor, int, 

Candidates:
        aten::unsqueeze(Tensor(a) self, int dim) -> Tensor(a)

Please take a look, @xiangyuT.

xiangyuT commented 6 months ago

Speculative decoding with BIGDL_OPT_IPEX=true needs attention_mask for generate() method. Should be fixed by #10236.

Jasonzzt commented 6 months ago

Issue closed