AssertionError: Speculative decoding not yet supported for XPU backend

HiddenPeak commented 3 days ago

#!/bin/bash
model="/llm/models/Qwen2.5-32B-Instruct"
served_model_name="Qwen2.5-32B-FP8"

export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.85 \
  --device xpu \
  --dtype auto \
  --enforce-eager \
  --use-v2-block-manager \
  --speculative-model "/llm/models/Qwen2.5-0.5B-Instruct"  \
  --speculative-draft-tensor-parallel-size 1 \
  --num-speculative-tokens 5 \
  --load-in-low-bit sym_int8 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4000 \
  --max-num-seqs 12 \
  --tensor-parallel-size 4 \
  --disable-async-output-proc \
  --distributed-executor-backend ray

When I setting Speculative decoding via ipex vllm docker contariner , It show me this :

INFO 11-28 21:37:01 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/Qwen2.5-32B-Instruct', speculative_config=SpeculativeConfig(draft_model='/llm/models/Qwen2.5-0.5B-Instruct', num_spec_tokens=5), tokenizer='/llm/models/Qwen2.5-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-32B-FP8, use_v2_block_manager=True, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
Process SpawnProcess-49:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 145, in run_mp_engine
    engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 133, in from_engine_args
    return super().from_engine_args(engine_args, usage_context, ipc_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
           ^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 325, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/vllm/executor/xpu_executor.py", line 38, in __init__
    assert (not speculative_config
            ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Speculative decoding not yet supported for XPU backend
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 195, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

gc-fu commented 3 days ago

Hi, this feature is not supported yet on xpu, we will see if we can support this feature.

HumerousGorgon commented 2 days ago

+1 to this, was just thinking about it earlier on today. Went to set it up and realised that it's not supported on the XPU backend. Would massively speed up model performance. Thanks!

HiddenPeak commented 1 day ago

this feature is very useful for my applications. Ipex serve qwen32b-int8，too slow to use on 4 Arc 770 cards. I hope that I can follow your update， Testting and use it.

HumerousGorgon commented 1 day ago

this feature is very useful for my applications. Ipex serve qwen32b-int8，too slow to use on 4 Arc 770 cards. I hope that I can follow your update， Testting and use it.

What is your metric for 'too slow'? I run Qwen32b on 2 Arc A770's and I get around 22t/s on text generation and very very fast inference speeds (hundreds of tokens per second) with a 10240 context window.

HiddenPeak commented 1 day ago

4 Arc 770，2.4Tokens/s Plex8756 x16 pcie3.0 Qwen32B-int8

my target is qwen72-int4，but I can not run it with the ipex-serve docker.

It stop convert int4...，every time.😭

HiddenPeak commented 1 day ago

this feature is very useful for my applications. Ipex serve qwen32b-int8，too slow to use on 4 Arc 770 cards. I hope that I can follow your update， Testting and use it.

What is your metric for 'too slow'? I run Qwen32b on 2 Arc A770's and I get around 22t/s on text generation and very very fast inference speeds (hundreds of tokens per second) with a 10240 context window.

yes，2Arc770 is very fast.

HumerousGorgon commented 1 day ago

4 Arc 770，2.4Tokens/s Plex8756 x16 pcie3.0 Qwen32B-int8

my target is qwen72-int4，but I can not run it with the ipex-serve docker.

It stop convert int4...，every time.😭

Honestly, I would attempt to get the AWQ variant of it, then use the load-in-low-bit type asym_int4. Been working very well for me :)

intel-analytics / ipex-llm

AssertionError: Speculative decoding not yet supported for XPU backend #12463