Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
ERROR message:
2024-08-21 10:23:49,136 - INFO - intel_extension_for_pytorch auto imported
INFO 08-21 10:23:50 api_server.py:258] vLLM API server version 0.3.3
INFO 08-21 10:23:50 api_server.py:259] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name='Qwen1.5-14B-Chat', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], load_in_low_bit='fp8', model='/llm/models/Qwen1.5-14B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, seed=0, swap_space=4, gpu_memory_utilization=0.95, max_num_batched_tokens=4000, max_num_seqs=256, max_paddings=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='xpu', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 08-21 10:23:50 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 08-21 10:23:50 config.py:523] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-08-21 10:23:51,166 WARNING services.py:2017 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67067904 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2024-08-21 10:23:52,227 INFO worker.py:1781 -- Started a local Ray instance.
INFO 08-21 10:23:53 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/llm/models/Qwen1.5-14B-Chat', tokenizer='/llm/models/Qwen1.5-14B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=4000, max_num_seqs=256, max_model_len=2048)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(RayWorkerVllm pid=7746) /usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
(RayWorkerVllm pid=7746) warnings.warn(
(RayWorkerVllm pid=7746) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
(RayWorkerVllm pid=7746) warn(
(RayWorkerVllm pid=7746) 2024-08-21 10:23:58,055 - INFO - intel_extension_for_pytorch auto imported
INFO 08-21 10:23:58 attention.py:71] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=7746) INFO 08-21 10:23:58 attention.py:71] flash_attn is not found. Using xformers backend.
2024-08-21 10:23:59,240 - INFO - Converting the current model to fp8_e5m2 format......
2024-08-21 10:23:59,240 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
[2024-08-21 10:23:59,547] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect)
2024-08-21 10:24:02,683 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 08-21 10:24:04 model_convert.py:249] Loading model weights took 7.4476 GB
(RayWorkerVllm pid=7746) 2024-08-21 10:24:05,822 - INFO - Converting the current model to fp8_e5m2 format......
(RayWorkerVllm pid=7746) 2024-08-21 10:24:05,822 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(RayWorkerVllm pid=7746) [2024-08-21 10:24:06,124] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect)
(RayWorkerVllm pid=7746) 2024-08-21 10:24:21,767 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024:08:21-10:24:25:( 4456) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2024:08:21-10:24:25:( 4456) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher
(RayWorkerVllm pid=7746) INFO 08-21 10:24:25 model_convert.py:249] Loading model weights took 7.4476 GB
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:867 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 0
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:907 atl_ofi_get_prov_list: can't create providers for name shm
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:1243 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
fails with status: 1
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:1384 atl_ofi_open_nw_provs: can not open network providers
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi.cpp:1036 open_providers: atl_ofi_open_nw_provs failed with status: 1
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi.cpp:175 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
fails with status: 1
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi.cpp:243 init: can't find suitable provider
2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_comm.cpp:278 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
failed to initialize ATL
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 267, in
engine = IPEXLLMAsyncLLMEngine.from_engine_args(engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 57, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 30, in init
super().init(args, kwargs)
File "/llm/vllm/vllm/engine/async_llm_engine.py", line 309, in init
self.engine = self._init_engine(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/engine/async_llm_engine.py", line 409, in _init_engine
return engine_class(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/engine/llm_engine.py", line 106, in init
self.model_executor = executor_class(model_config, cache_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 77, in init
self._init_cache()
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 249, in _init_cache
num_blocks = self._run_workers(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 347, in _run_workers
driver_worker_output = getattr(self.driver_worker,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/worker.py", line 136, in profile_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/model_runner.py", line 645, in profile_run
self.execute_model(seqs, kv_caches)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/model_runner.py", line 569, in execute_model
lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/model_runner.py", line 538, in prepare_input_tensors
broadcast_tensor_dict(metadata_dict, src=0)
File "/llm/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 175, in broadcast_tensor_dict
torch.distributed.broadcast_object_list([metadata_list],
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2603, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: oneCCL: atl_ofi_comm.cpp:278 init_transport: EXCEPTION: failed to initialize ATL
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:867 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 0
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:907 atl_ofi_get_prov_list: can't create providers for name shm
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:1243 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
(RayWorkerVllm pid=7746) fails with status: 1
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:1384 atl_ofi_open_nw_provs: can not open network providers
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi.cpp:1036 open_providers: atl_ofi_open_nw_provs failed with status: 1
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi.cpp:175 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
(RayWorkerVllm pid=7746) fails with status: 1
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi.cpp:243 init: can't find suitable provider
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_comm.cpp:278 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
(RayWorkerVllm pid=7746) failed to initialize ATL
HW platform: XeonW + 4Arc workstation docker image: intelanalytics/ipex-llm-serving-xpu:2.1.0b Serving start commands:
cat start_Qwen1.5-14B-Chat_serving.sh
!/bin/bash
model="/llm/models/Qwen1.5-14B-Chat" served_model_name="Qwen1.5-14B-Chat"
export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_CACHE_PERSISTENT=1 export TORCH_LLM_ALLREDUCE=0 export CCL_DG2_ALLREDUCE=1
Tensor parallel related arguments:
export CCL_WORKER_COUNT=2 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1
source /opt/intel/oneapi/setvars.sh source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ --device xpu \ --dtype float16 \ --enforce-eager \ --load-in-low-bit fp8 \ --max-model-len 2048 \ --max-num-batched-tokens 4000 \ --tensor-parallel-size 2
ERROR message: 2024-08-21 10:23:49,136 - INFO - intel_extension_for_pytorch auto imported INFO 08-21 10:23:50 api_server.py:258] vLLM API server version 0.3.3 INFO 08-21 10:23:50 api_server.py:259] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, served_model_name='Qwen1.5-14B-Chat', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], load_in_low_bit='fp8', model='/llm/models/Qwen1.5-14B-Chat', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='float16', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, seed=0, swap_space=4, gpu_memory_utilization=0.95, max_num_batched_tokens=4000, max_num_seqs=256, max_paddings=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='xpu', engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 08-21 10:23:50 config.py:710] Casting torch.bfloat16 to torch.float16. INFO 08-21 10:23:50 config.py:523] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved. 2024-08-21 10:23:51,166 WARNING services.py:2017 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67067904 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. 2024-08-21 10:23:52,227 INFO worker.py:1781 -- Started a local Ray instance. INFO 08-21 10:23:53 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/llm/models/Qwen1.5-14B-Chat', tokenizer='/llm/models/Qwen1.5-14B-Chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=4000, max_num_seqs=256, max_model_len=2048) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (RayWorkerVllm pid=7746) /usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations (RayWorkerVllm pid=7746) warnings.warn( (RayWorkerVllm pid=7746) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 267, in
engine = IPEXLLMAsyncLLMEngine.from_engine_args(engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 57, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 30, in init
super().init( args, kwargs)
File "/llm/vllm/vllm/engine/async_llm_engine.py", line 309, in init
self.engine = self._init_engine(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/engine/async_llm_engine.py", line 409, in _init_engine
return engine_class(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/engine/llm_engine.py", line 106, in init
self.model_executor = executor_class(model_config, cache_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 77, in init
self._init_cache()
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 249, in _init_cache
num_blocks = self._run_workers(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/ipex_llm_gpu_executor.py", line 347, in _run_workers
driver_worker_output = getattr(self.driver_worker,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/worker.py", line 136, in profile_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/model_runner.py", line 645, in profile_run
self.execute_model(seqs, kv_caches)
File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/model_runner.py", line 569, in execute_model
lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/llm/vllm/vllm/worker/model_runner.py", line 538, in prepare_input_tensors
broadcast_tensor_dict(metadata_dict, src=0)
File "/llm/vllm/vllm/model_executor/parallel_utils/communication_op.py", line 175, in broadcast_tensor_dict
torch.distributed.broadcast_object_list([metadata_list],
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2603, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: oneCCL: atl_ofi_comm.cpp:278 init_transport: EXCEPTION: failed to initialize ATL
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:867 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 0
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:907 atl_ofi_get_prov_list: can't create providers for name shm
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:1243 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list)
(RayWorkerVllm pid=7746) fails with status: 1
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_helper.cpp:1384 atl_ofi_open_nw_provs: can not open network providers
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi.cpp:1036 open_providers: atl_ofi_open_nw_provs failed with status: 1
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi.cpp:175 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
(RayWorkerVllm pid=7746) fails with status: 1
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi.cpp:243 init: can't find suitable provider
(RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_ERROR| atl_ofi_comm.cpp:278 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed
(RayWorkerVllm pid=7746) failed to initialize ATL
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? (RayWorkerVllm pid=7746) warn( (RayWorkerVllm pid=7746) 2024-08-21 10:23:58,055 - INFO - intel_extension_for_pytorch auto imported INFO 08-21 10:23:58 attention.py:71] flash_attn is not found. Using xformers backend. (RayWorkerVllm pid=7746) INFO 08-21 10:23:58 attention.py:71] flash_attn is not found. Using xformers backend. 2024-08-21 10:23:59,240 - INFO - Converting the current model to fp8_e5m2 format...... 2024-08-21 10:23:59,240 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [2024-08-21 10:23:59,547] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect) 2024-08-21 10:24:02,683 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations INFO 08-21 10:24:04 model_convert.py:249] Loading model weights took 7.4476 GB (RayWorkerVllm pid=7746) 2024-08-21 10:24:05,822 - INFO - Converting the current model to fp8_e5m2 format...... (RayWorkerVllm pid=7746) 2024-08-21 10:24:05,822 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations (RayWorkerVllm pid=7746) [2024-08-21 10:24:06,124] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect) (RayWorkerVllm pid=7746) 2024-08-21 10:24:21,767 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024:08:21-10:24:25:( 4456) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2024:08:21-10:24:25:( 4456) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher (RayWorkerVllm pid=7746) INFO 08-21 10:24:25 model_convert.py:249] Loading model weights took 7.4476 GB (RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL (RayWorkerVllm pid=7746) 2024:08:21-10:24:26:( 7746) |CCL_WARN| fallback to 'sockets' mode of ze exchange mechanism, to use CCL_ZE_IPC_EXHANGE=drmfd, set CCL_LOCAL_RANK/SIZE explicitly or use process launcher 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:867 atl_ofi_get_prov_list: fi_getinfo error: ret -61, providers 0 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:907 atl_ofi_get_prov_list: can't create providers for name shm 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:1243 atl_ofi_open_nw_provs: atl_ofi_get_prov_list(ctx, prov_name, base_hints, &prov_list) fails with status: 1 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_helper.cpp:1384 atl_ofi_open_nw_provs: can not open network providers 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi.cpp:1036 open_providers: atl_ofi_open_nw_provs failed with status: 1 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi.cpp:175 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true ) fails with status: 1 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi.cpp:243 init: can't find suitable provider 2024:08:21-10:24:26:( 4456) |CCL_ERROR| atl_ofi_comm.cpp:278 init_transport: condition transport->init(nullptr, nullptr, &attr, nullptr, pmi) == ATL_STATUS_SUCCESS failed failed to initialize ATL Traceback (most recent call last): File "