intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

Failure to load the LLM model in vLLM on 8 ARC #11789

Open oldmikeyang opened 1 month ago

oldmikeyang commented 1 month ago

With the ipex-llm docker container, intelanalytics/ipex-llm-serving-vllm-xpu-experiment:2.1.0b2

it successfully load model in 4 ARC. But when load model in 8 ARC, it will have the following error.

root@GPU-Xeon4410Y-ARC770:/llm# bash start-vllm-service.sh /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-08-14 11:07:55,600 - INFO - intel_extension_for_pytorch auto imported INFO 08-14 11:07:56 api_server.py:258] vLLM API server version 0.3.3 INFO 08-14 11:07:56 api_server.py:259] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name='Qwen1.5-7B-Chat', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], load_in_low_bit='fp6', model='/llm/models/Qwen/Qwen1.5-7B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, seed=0, swap_space=4, gpu_memory_utilization=0.75, max_num_batched_tokens=10240, max_num_seqs=12, max_paddings=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='xpu', engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 08-14 11:07:56 config.py:710] Casting torch.bfloat16 to torch.float16. INFO 08-14 11:07:56 config.py:523] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved. 2024-08-14 11:07:58,897 INFO worker.py:1788 -- Started a local Ray instance. INFO 08-14 11:07:59 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/llm/models/Qwen/Qwen1.5-7B-Chat', tokenizer='/llm/models/Qwen/Qwen1.5-7B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=8, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=10240, max_num_seqs=12, max_model_len=4096) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (RayWorkerVllm pid=32282) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? (RayWorkerVllm pid=32282) warn( (RayWorkerVllm pid=32483) 2024-08-14 11:08:17,825 - INFO - intel_extension_for_pytorch auto imported INFO 08-14 11:08:18 attention.py:71] flash_attn is not found. Using xformers backend. (RayWorkerVllm pid=32094) INFO 08-14 11:08:18 attention.py:71] flash_attn is not found. Using xformers backend. 2024-08-14 11:08:19,069 - INFO - Converting the current model to fp6 format...... 2024-08-14 11:08:19,069 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [2024-08-14 11:08:20,124] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect) (RayWorkerVllm pid=32483) 2024-08-14 11:08:20,271 - INFO - Converting the current model to fp6 format...... (RayWorkerVllm pid=32483) 2024-08-14 11:08:20,272 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations (RayWorkerVllm pid=32094) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerVllm pid=32094) warn( [repeated 6x across cluster] 2024-08-14 11:08:21,272 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations (RayWorkerVllm pid=32483) [2024-08-14 11:08:21,256] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect) INFO 08-14 11:08:21 model_convert.py:249] Loading model weights took 1.0264 GB (RayWorkerVllm pid=32349) 2024-08-14 11:08:18,290 - INFO - intel_extension_for_pytorch auto imported [repeated 6x across cluster] (RayWorkerVllm pid=32551) 2024-08-14 11:08:20,708 - INFO - Converting the current model to fp6 format...... [repeated 6x across cluster] (RayWorkerVllm pid=32483) 2024-08-14 11:08:25,761 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 7x across cluster] (RayWorkerVllm pid=32483) INFO 08-14 11:08:26 model_convert.py:249] Loading model weights took 1.0264 GB (RayWorkerVllm pid=32551) INFO 08-14 11:08:18 attention.py:71] flash_attn is not found. Using xformers backend. [repeated 6x across cluster] (RayWorkerVllm pid=32551) [2024-08-14 11:08:21,778] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect) [repeated 6x across cluster] 2024:08:14-11:08:27:(28904) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2024:08:14-11:08:27:(28904) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher (RayWorkerVllm pid=32094) 2024:08:14-11:08:28:(32094) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL (RayWorkerVllm pid=32094) 2024:08:14-11:08:28:(32094) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher 2024:08:14-11:08:29:(33884) |CCL_WARN| no membind support for NUMA node 1, skip thread membind 2024:08:14-11:08:29:(33896) |CCL_WARN| no membind support for NUMA node 1, skip thread membind (RayWorkerVllm pid=32094) 2024:08:14-11:08:29:(33886) |CCL_WARN| no membind support for NUMA node 1, skip thread membind (RayWorkerVllm pid=32094) 2024:08:14-11:08:29:(33892) |CCL_WARN| no membind support for NUMA node 1, skip thread membind 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:30:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32094) 2024:08:14-11:08:30:(32094) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2024:08:14-11:08:32:(28904) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (RayWorkerVllm pid=32162) INFO 08-14 11:08:27 model_convert.py:249] Loading model weights took 1.0264 GB [repeated 6x across cluster] Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 267, in engine = IPEXLLMAsyncLLMEngine.from_engine_args(engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 57, in from_engine_args engine = cls(parallel_config.worker_use_ray, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 30, in init super().init(*args, kwargs) File "/llm/vllm/vllm/engine/async_llm_engine.py", line 309, in init self.engine = self._init_engine(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/engine/async_llm_engine.py", line 409, in _init_engine return engine_class(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/engine/llm_engine.py", line 106, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 77, in init self._init_cache() File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 249, in _init_cache num_blocks = self._run_workers( ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 347, in _run_workers driver_worker_output = getattr(self.driver_worker, ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/worker/worker.py", line 136, in profile_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/worker/model_runner.py", line 645, in profile_run self.execute_model(seqs, kv_caches) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/worker/model_runner.py", line 581, in execute_model hidden_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 316, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 257, in forward hidden_states, residual = layer( ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 208, in forward hidden_states, residual = self.input_layernorm( ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/layers/layernorm.py", line 52, in forward ops.fused_add_rms_norm( TypeError: fused_add_rms_norm(): incompatible function arguments. The following argument types are supported:

  1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: float) -> None

Invoked with: (tensor([[[-0.0239, 0.0522, 0.0044, ..., -0.0462, 0.1113, 0.0284], [-0.0239, 0.0522, 0.0044, ..., -0.0462, 0.1113, 0.0284], [-0.0239, 0.0522, 0.0044, ..., -0.0462, 0.1113, 0.0284], ..., [-0.0240, 0.0522, 0.0044, ..., -0.0462, 0.1114, 0.0284], [-0.0240, 0.0522, 0.0044, ..., -0.0462, 0.1114, 0.0284], [-0.0240, 0.0522, 0.0044, ..., -0.0462, 0.1114, 0.0284]],

    [[-0.0240,  0.0522,  0.0044,  ..., -0.0462,  0.1114,  0.0284],
     [-0.0240,  0.0522,  0.0044,  ..., -0.0462,  0.1114,  0.0284],
     [-0.0240,  0.0522,  0.0044,  ..., -0.0462,  0.1114,  0.0284],
     ...,
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284]],

    [[-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     ...,
     [-0.0239,  0.0522,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0044,  ..., -0.0462,  0.1113,  0.0284]],

    ...,

    [[-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     ...,
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284]],

    [[-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0521,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     ...,
     [-0.0239,  0.0522,  0.0045,  ..., -0.0461,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0045,  ..., -0.0461,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0045,  ..., -0.0461,  0.1113,  0.0284]],

    [[-0.0239,  0.0522,  0.0045,  ..., -0.0461,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0045,  ..., -0.0461,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0045,  ..., -0.0461,  0.1113,  0.0284],
     ...,
     [-0.0239,  0.0522,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0044,  ..., -0.0462,  0.1113,  0.0284],
     [-0.0239,  0.0522,  0.0044,  ..., -0.0462,  0.1113,  0.0284]]],
   device='xpu:0', dtype=torch.float16), None), tensor([[[-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     ...,
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166]],

    [[-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     ...,
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166]],

    [[-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     ...,
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165]],

    ...,

    [[-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     ...,
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166]],

    [[-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0166],
     ...,
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165]],

    [[-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     ...,
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165],
     [-0.0142, -0.0132,  0.0210,  ...,  0.0883,  0.0250,  0.0165]]],
   device='xpu:0', dtype=torch.float16), tensor([0.1367, 0.0952, 0.1030,  ..., 0.1338, 0.0845, 0.0928], device='xpu:0',
   dtype=torch.float16), 1e-06

(RayWorkerVllm pid=32551) 2024:08:14-11:08:28:(32551) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL [repeated 6x across cluster] (RayWorkerVllm pid=32551) 2024:08:14-11:08:28:(32551) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher [repeated 6x across cluster] (RayWorkerVllm pid=32551) 2024:08:14-11:08:29:(33894) |CCL_WARN| no membind support for NUMA node 0, skip thread membind [repeated 12x across cluster] (RayWorkerVllm pid=32551) 2024:08:14-11:08:32:(32551) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 728x across cluster] (RayWorkerVllm pid=32162) 2024-08-14 11:08:27,278 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 6x across cluster]

gc-fu commented 4 weeks ago

Hi, I am currently investigating this issue. Will update to this issue once I fix it

gc-fu commented 4 weeks ago

Hi, this should have been fixed by PR: https://github.com/intel-analytics/ipex-llm/pull/11817

You can upgrade ipex-llm tomorrow and see if this works.

oldmikeyang commented 4 weeks ago

with latest IPEX-LLM, the following error during inference

INFO 08-16 10:12:59 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 08-16 10:12:59 async_llm_engine.py:494] Received request cmpl-a50bf7e6bc264357815b2c77018ec28e-0: prompt: 'San Francisco is a', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=128, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None. INFO 08-16 10:13:09 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO 08-16 10:13:19 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% ERROR 08-16 10:13:19 async_llm_engine.py:41] Engine background task failed ERROR 08-16 10:13:19 async_llm_engine.py:41] Traceback (most recent call last): ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish ERROR 08-16 10:13:19 async_llm_engine.py:41] task.result() ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 467, in run_engine_loop ERROR 08-16 10:13:19 async_llm_engine.py:41] has_requests_in_progress = await asyncio.wait_for( ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/usr/lib/python3.11/asyncio/tasks.py", line 489, in wait_for ERROR 08-16 10:13:19 async_llm_engine.py:41] return fut.result() ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 441, in engine_step ERROR 08-16 10:13:19 async_llm_engine.py:41] request_outputs = await self.engine.step_async() ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 211, in step_async ERROR 08-16 10:13:19 async_llm_engine.py:41] output = await self.model_executor.execute_model_async( ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 443, in execute_model_async ERROR 08-16 10:13:19 async_llm_engine.py:41] all_outputs = await self._run_workers_async( ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 433, in _run_workers_async ERROR 08-16 10:13:19 async_llm_engine.py:41] all_outputs = await asyncio.gather(coros) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/usr/lib/python3.11/asyncio/tasks.py", line 694, in _wrap_awaitable ERROR 08-16 10:13:19 async_llm_engine.py:41] return (yield from awaitable.await()) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerVllm.execute_method() (pid=195136, ip=10.240.108.91, actor_id=b933b7411289683bf7fc97c201000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x77b08879b6d0>) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/engine/ray_utils.py", line 37, in execute_method ERROR 08-16 10:13:19 async_llm_engine.py:41] return executor(args, kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 08-16 10:13:19 async_llm_engine.py:41] return func(*args, *kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/worker/worker.py", line 236, in execute_model ERROR 08-16 10:13:19 async_llm_engine.py:41] output = self.model_runner.execute_model(seq_group_metadata_list, ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ERROR 08-16 10:13:19 async_llm_engine.py:41] return func(args, kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/worker/model_runner.py", line 581, in execute_model ERROR 08-16 10:13:19 async_llm_engine.py:41] hidden_states = model_executable( ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl ERROR 08-16 10:13:19 async_llm_engine.py:41] return self._call_impl(*args, kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl ERROR 08-16 10:13:19 async_llm_engine.py:41] return forward_call(*args, *kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/model_executor/models/qwen2.py", line 316, in forward ERROR 08-16 10:13:19 async_llm_engine.py:41] hidden_states = self.model(input_ids, positions, kv_caches, ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl ERROR 08-16 10:13:19 async_llm_engine.py:41] return self._call_impl(args, kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl ERROR 08-16 10:13:19 async_llm_engine.py:41] return forward_call(*args, kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/model_executor/models/qwen2.py", line 253, in forward ERROR 08-16 10:13:19 async_llm_engine.py:41] hidden_states = self.embed_tokens(input_ids) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl ERROR 08-16 10:13:19 async_llm_engine.py:41] return self._call_impl(*args, *kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl ERROR 08-16 10:13:19 async_llm_engine.py:41] return forward_call(args, kwargs) ERROR 08-16 10:13:19 async_llm_engine.py:41] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] File "/home/llm/vllm-ipex-forked/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward ERROR 08-16 10:13:19 async_llm_engine.py:41] output_parallel[input_mask, :] = 0.0 ERROR 08-16 10:13:19 async_llm_engine.py:41] ~~~^^^^^^^^^^^^^^^ ERROR 08-16 10:13:19 async_llm_engine.py:41] RuntimeError: Allocation is out of device memory on current platform. 2024-08-16 10:13:19,835 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x701317444040>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.xpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7013133c7310>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x701317444040>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.xpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7013133c7310>>)> Traceback (most recent call last): File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish task.result() File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 467, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/tasks.py", line 489, in wait_for return fut.result() ^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 441, in engine_step request_outputs = await self.engine.step_async() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 211, in step_async output = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 443, in execute_model_async all_outputs = await self._run_workers_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 433, in _run_workers_async all_outputs = await asyncio.gather(coros) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/asyncio/tasks.py", line 694, in _wrap_awaitable return (yield from awaitable.await()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerVllm.execute_method() (pid=195136, ip=10.240.108.91, actor_id=b933b7411289683bf7fc97c201000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x77b08879b6d0>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/engine/ray_utils.py", line 37, in execute_method return executor(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/worker/worker.py", line 236, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/worker/model_runner.py", line 581, in execute_model hidden_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/model_executor/models/qwen2.py", line 316, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/model_executor/models/qwen2.py", line 253, in forward hidden_states = self.embed_tokens(input_ids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/llm/vllm-ipex-forked/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward output_parallel[input_mask, :] = 0.0


RuntimeError: Allocation is out of device memory on current platform.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 43, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 08-16 10:13:19 async_llm_engine.py:152] Aborted request cmpl-a50bf7e6bc264357815b2c77018ec28e-0.
INFO:     127.0.0.1:44858 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 213, in create_completion
    generator = await openai_serving_completion.create_completion(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/entrypoints/openai/serving_completion.py", line 179, in create_completion
    async for i, res in result_generator:
  File "/home/llm/vllm-ipex-forked/vllm/entrypoints/openai/serving_completion.py", line 82, in consumer
    raise item
  File "/home/llm/vllm-ipex-forked/vllm/entrypoints/openai/serving_completion.py", line 67, in producer
    async for item in iterator:
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 625, in generate
    raise e
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 619, in generate
    async for request_output in stream:
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 75, in __anext__
    raise result
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 36, in _raise_exception_on_finish
    task.result()
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 467, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 441, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/engine/async_llm_engine.py", line 211, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 443, in execute_model_async
    all_outputs = await self._run_workers_async(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 433, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/tasks.py", line 694, in _wrap_awaitable
    return (yield from awaitable.__await__())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerVllm.execute_method() (pid=195136, ip=10.240.108.91, actor_id=b933b7411289683bf7fc97c201000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x77b08879b6d0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/engine/ray_utils.py", line 37, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/worker/worker.py", line 236, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/worker/model_runner.py", line 581, in execute_model
    hidden_states = model_executable(
                    ^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/model_executor/models/qwen2.py", line 316, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/model_executor/models/qwen2.py", line 253, in forward
    hidden_states = self.embed_tokens(input_ids)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llm/vllm-ipex-forked/vllm/model_executor/layers/vocab_parallel_embedding.py", line 107, in forward
    output_parallel[input_mask, :] = 0.0
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
RuntimeError: Allocation is out of device memory on current platform.
INFO 08-16 10:13:29 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 08-16 10:13:39 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
2024:08:16-10:13:39:(196608) |CCL_ERROR| worker.cpp:353 ccl_worker_func: worker 6 caught internal exception: oneCCL: ze_call.cpp:43 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
[2024-08-16 10:13:39,930 E 191503 196608] logging.cc:108: Unhandled exception: N3ccl2v19exceptionE. what(): oneCCL: ze_call.cpp:43 do_call: EXCEPTION: ze error at zeCommandQueueExecuteCommandLists, code: ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
[2024-08-16 10:13:39,938 E 191503 196608] logging.cc:115: Stack trace:
 /home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ray/_raylet.so(+0x10b7bea) [0x7013082b7bea] ray::operator<<()
/home/llm/venv/ipex-llm-0816/lib/python3.11/site-packages/ray/_raylet.so(+0x10bae72) [0x7013082bae72] ray::TerminateHandler()
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x70128c4ae20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x70128c4ae277]
/opt/intel/1ccl-wks/lib/libccl.so.1(+0x4c26e9) [0x6fe1a54c26e9] ccl_worker_func()
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x70131d494ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x70131d526850]

*** SIGABRT received at time=1723774419 on cpu 41 ***
PC: @     0x70131d4969fc  (unknown)  pthread_kill
    @     0x70131d442520  (unknown)  (unknown)
[2024-08-16 10:13:39,938 E 191503 196608] logging.cc:440: *** SIGABRT received at time=1723774419 on cpu 41 ***
[2024-08-16 10:13:39,938 E 191503 196608] logging.cc:440: PC: @     0x70131d4969fc  (unknown)  pthread_kill
[2024-08-16 10:13:39,939 E 191503 196608] logging.cc:440:     @     0x70131d442520  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, markupsafe._speedups, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pyarrow._json, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups (total: 49)

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: spr [Intel(R) Xeon(R) Silver 4410Y]
Registry and code: 13 MB
Command: python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct --port 8000 --model /home/llm/local_models/Qwen/Qwen2-72B-Instruct --trust-remote-code --gpu-memory-utilization 0.90 --device xpu --dtype float16 --enforce-eager --load-in-low-bit fp8 --max-model-len 6656 --max-num-batched-tokens 6656 --tensor-parallel-size 8
Uptime: 3880.324215 s
start_vllm_arc.sh: line 28: 191503 Aborted                 (core dumped) python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server --served-model-name $served_model_name --port 8000 --model $model --trust-remote-code --gpu-memory-utilization 0.90 --device xpu --dtype float16 --enforce-eager --load-in-low-bit fp8 --max-model-len 6656 --max-num-batched-tokens 6656 --tensor-parallel-size 8

(ipex-llm-0816) llm@GPU-Xeon4410Y-ARC770:~/ipex-llm-0816/python/llm/scripts$ bash env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240815
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             48
On-line CPU(s) list:                0-47
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Silver 4410Y
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3900.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4000.00
-----------------------------------------------------------------
Total CPU Memory: 755.542 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.4 LTS \n \l

-----------------------------------------------------------------
Linux GPU-Xeon4410Y-ARC770 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.27.20240626
    Build ID: 7f002d24

Service:
    Version: 1.2.27.20240626
    Build ID: 7f002d24
    Level Zero Version: 1.16.0
-----------------------------------------------------------------
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
  Driver UUID                                     32342e31-332e-3239-3133-382e37000000
  Driver Version                                  24.13.29138.7
-----------------------------------------------------------------
Driver related package version:
ii  intel-fw-gpu                                   2024.17.5-329~22.04                     all          Firmware package for Intel integrated and discrete GPUs
ii  intel-i915-dkms                                1.24.3.23.240419.26+i30-1               all          Out of tree i915 driver.
ii  intel-level-zero-gpu                           1.3.29138.7                             amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                                 1.16.15-881~22.04                       amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
env-check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0019-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:19:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-002c-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:2c:00.0                                                        |
|           | DRM Device: /dev/dri/card2                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 2         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0052-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:52:00.0                                                        |
|           | DRM Device: /dev/dri/card3                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 3         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0065-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:65:00.0                                                        |
|           | DRM Device: /dev/dri/card4                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 4         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-009b-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:9b:00.0                                                        |
|           | DRM Device: /dev/dri/card5                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 5         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-00ad-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:ad:00.0                                                        |
|           | DRM Device: /dev/dri/card6                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 6         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-00d1-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:d1:00.0                                                        |
|           | DRM Device: /dev/dri/card7                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 7         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-00e3-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:e3:00.0                                                        |
|           | DRM Device: /dev/dri/card8                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16M
GPU1 Memory size=16G
GPU2 Memory size=16G
GPU3 Memory size=16G
GPU4 Memory size=16G
GPU5 Memory size=16G
GPU6 Memory size=16G
GPU7 Memory size=16G
GPU8 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52) (prog-if 00 [VGA controller])
        DeviceName: Onboard VGA
        Subsystem: ASPEED Technology, Inc. ASPEED Graphics Family
        Flags: medium devsel, IRQ 16, NUMA node 0
        Memory at 94000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 95000000 (32-bit, non-prefetchable) [size=256K]
        I/O ports at 2000 [size=128]
        Capabilities: <access denied>
        Kernel driver in use: ast
        Kernel modules: ast
--
19:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 130, NUMA node 0
        Memory at 9e000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 5f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 9f000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
2c:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 133, NUMA node 0
        Memory at a8000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 6f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at a9000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
52:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 136, NUMA node 0
        Memory at bc000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 8f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at bd000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
65:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 139, NUMA node 0
        Memory at c6000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 9f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at c7000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
9b:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 142, NUMA node 1
        Memory at d8000000 (64-bit, non-prefetchable) [size=16M]
        Memory at cf800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at d9000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
ad:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 145, NUMA node 1
        Memory at e0000000 (64-bit, non-prefetchable) [size=16M]
        Memory at df800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at e1000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
d1:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd Device 1334
        Flags: bus master, fast devsel, latency 0, IRQ 148, NUMA node 1
        Memory at f1000000 (64-bit, non-prefetchable) [size=16M]
        Memory at ff800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at f2000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
--
e3:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Flags: bus master, fast devsel, latency 0, IRQ 151, NUMA node 1
        Memory at f9000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 10f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at fa000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915
-----------------------------------------------------------------
gc-fu commented 4 weeks ago

Hi, this problem is due to out of memory. You can reduce gpu-utilization-rate, or reduce max-num-batched-tokens.

Use this command will fix the problem:

#!/bin/bash
model="/home/llm/local_models/Qwen/Qwen2-72B-Instruct"
served_model_name="Qwen2-72B-Instruct"
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export TORCH_LLM_ALLREDUCE=0
export CCL_DG2_ALLREDUCE=1
# Tensor parallel related arguments:
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit fp8 \
--max-model-len 4000 \
--max-num-batched-tokens 4000 \
--tensor-parallel-size 8