vllm_cpu_docker_quickstart run error on Aliyun ecs.c8i.24xlarge ECS

linuxliker commented 1 month ago

I follow the Doc: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md My Ecs is using Aliyun ecs.c8i.24xlarge ECS( https://help.aliyun.com/zh/ecs/use-cases/deploy-qwen-72b-chat-on-an-8th-generation-intel-instance )。

I get this error, how can I fix it ?

./start-vllm-service.sh WARNING 07-24 06:34:00 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with pip install ray. 2024-07-24 06:34:01,407 - INFO - vLLM API server version 0.4.2 2024-07-24 06:34:01,407 - INFO - args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/llm/models/Qwen/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=10240, max_num_seqs=12, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, device='cpu', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, served_model_name=['Qwen2-72B'], engine_use_ray=False, disable_log_requests=False, max_log_len=None, load_in_low_bit='bf16') WARNING 07-24 06:34:01 config.py:1086] Casting torch.float16 to torch.bfloat16. Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 177, in engine = IPEXLLMAsyncLLMEngine.from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/cpu/engine/engine.py", line 44, in from_engine_args engine_config = engine_args.create_engine_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/engine/arg_utils.py", line 520, in create_engine_config model_config = ModelConfig( ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/config.py", line 131, in init self._verify_quantization() File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/config.py", line 170, in _verify_quantization elif GPTQMarlinConfig.is_marlin_compatible(quant_cfg): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg/vllm/model_executor/layers/quantization/gptq_marlin.py", line 144, in is_marlin_compatible major, minor = torch.cuda.get_device_capability() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 430, in get_device_capability prop = get_device_properties(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 444, in get_device_properties _lazy_init() # will define _get_device_properties ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 284, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

xiangyuT commented 1 month ago

Hi @linuxliker, I have checked the document you mentioned but was unable to reproduce the error on our end. The vllm CPU service starts normally on our machine. To help us diagnose the problem further, could you please provide the Docker image ID you are using? Additionally, it would be very helpful if you could run the env-check.sh script to gather the environment information and share the results with us.

linuxliker commented 1 month ago

Hi @xiangyuT : docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE intelanalytics/ipex-llm-serving-cpu latest fa18fa759577 46 hours ago 7.54GB

Run env-check.sh in docker :

root@iZ2vc830mstih6jeumb6sdZ:/llm# ./env-check.sh

PYTHON_VERSION=3.11.9

transformers=4.42.4

torch=2.3.0+cpu

ipex-llm DEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/vllm-0.4.2+cpu-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 Version: 2.1.0b20240723

IPEX is not installed.

CPU Information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8575C CPU family: 6 Model: 207 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 1 Stepping: 2 CPU max MHz: 4000.0000 CPU min MHz: 800.0000 BogoMIPS: 5600.00

Total CPU Memory: 182.261 GB

Operating System: Ubuntu 22.04.4 LTS \n \l

Linux iZ2vc830mstih6jeumb6sdZ 6.8.0-31-generic #31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

./env-check.sh: line 148: xpu-smi: command not found

./env-check.sh: line 154: clinfo: command not found

Driver related package version:

./env-check.sh: line 167: sycl-ls: command not found igpu not detected

xpu-smi is not installed. Please install xpu-smi according to README.md

xiangyuT commented 1 month ago

Hi @xiangyuT : docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE intelanalytics/ipex-llm-serving-cpu latest fa18fa759577 46 hours ago 7.54GB

It appears that your environment is configured correctly. However, I noticed that you are running a GPTQ-Int4 model, which might not be supported by the vLLM CPU. You can find more details in the vLLM documentation.

To use int4 quantization with vLLM CPU, you can utilize IPEX-LLM by setting --load-in-low-bit to sym_int4 in the start-vllm-service.sh script.

Additionally, there is a recent bug fix for IPEX LLM vLLM CPU. I recommend updating your Docker image to the latest version. Please note that it is a daily updated image, so the SHA256 hash (e.g., 9e838920c9ab for the 0726 version) might change. You can verify the latest pushed time on the Docker site.

intel-analytics / ipex-llm