Open linuxliker opened 1 month ago
Hi @linuxliker, I have checked the document you mentioned but was unable to reproduce the error on our end. The vllm CPU service starts normally on our machine. To help us diagnose the problem further, could you please provide the Docker image ID you are using? Additionally, it would be very helpful if you could run the env-check.sh script to gather the environment information and share the results with us.
Hi @xiangyuT : docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE intelanalytics/ipex-llm-serving-cpu latest fa18fa759577 46 hours ago 7.54GB
Run env-check.sh in docker :
Operating System: Ubuntu 22.04.4 LTS \n \l
xpu-smi is not installed. Please install xpu-smi according to README.md
Hi @xiangyuT : docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE intelanalytics/ipex-llm-serving-cpu latest fa18fa759577 46 hours ago 7.54GB
It appears that your environment is configured correctly. However, I noticed that you are running a GPTQ-Int4 model, which might not be supported by the vLLM CPU. You can find more details in the vLLM documentation.
To use int4 quantization with vLLM CPU, you can utilize IPEX-LLM by setting --load-in-low-bit
to sym_int4
in the start-vllm-service.sh script.
Additionally, there is a recent bug fix for IPEX LLM vLLM CPU. I recommend updating your Docker image to the latest version. Please note that it is a daily updated image, so the SHA256 hash (e.g., 9e838920c9ab
for the 0726 version) might change. You can verify the latest pushed time on the Docker site.
I follow the Doc: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md My Ecs is using Aliyun ecs.c8i.24xlarge ECS( https://help.aliyun.com/zh/ecs/use-cases/deploy-qwen-72b-chat-on-an-8th-generation-intel-instance )。
I get this error, how can I fix it ?
./start-vllm-service.sh WARNING 07-24 06:34:00 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 177, in
engine = IPEXLLMAsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/cpu/engine/engine.py", line 44, in from_engine_args
engine_config = engine_args.create_engine_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/arg_utils.py", line 520, in create_engine_config
model_config = ModelConfig(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/config.py", line 131, in init
self._verify_quantization()
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/config.py", line 170, in _verify_quantization
elif GPTQMarlinConfig.is_marlin_compatible(quant_cfg):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/layers/quantization/gptq_marlin.py", line 144, in is_marlin_compatible
major, minor = torch.cuda.get_device_capability()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 430, in get_device_capability
prop = get_device_properties(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 444, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 284, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
pip install ray
. 2024-07-24 06:34:01,407 - INFO - vLLM API server version 0.4.2 2024-07-24 06:34:01,407 - INFO - args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/llm/models/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=10240, max_num_seqs=12, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, device='cpu', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, served_model_name=['Qwen2-72B'], engine_use_ray=False, disable_log_requests=False, max_log_len=None, load_in_low_bit='bf16') WARNING 07-24 06:34:01 config.py:1086] Casting torch.float16 to torch.bfloat16. Traceback (most recent call last): File "