Open githebs opened 2 months ago
From my findings it happends when the below line is running on the same port as vllm it doesn't work but if it's another one it works
DEBUG 08-14 10:37:47 parallel_state.py:803] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.130.32.49:8001 backend=nccl
it might be in relation to https://github.com/vllm-project/vllm/issues/7196 but nothing sure for now
Don't set the value of VLLM_PORT
, its used internally by vLLM for distributed communications. If you want to set the port used by the api server pass --port $MY_PORT
to the launch command
@githebs see https://docs.vllm.ai/en/latest/serving/env_vars.html :
Please note that VLLM_PORT and VLLM_HOST_IP set the port and ip for vLLM’s internal usage. It is not the port and ip for the API server. If you use --host $VLLM_HOST_IP and --port $VLLM_PORT to start the API server, it will not work.
All environment variables used by vLLM are prefixed with VLLM_. Special care should be taken for Kubernetes users: please do not name the service as vllm, otherwise environment variables set by Kubernetes might conflict with vLLM’s environment variables, because Kubernetes sets environment variables for each service with the capitalized service name as the prefix.
vllm 0.6.2 I encountered the same issue, and it only occurs when using pipeline parallel deployment. When I use tensor parallel deployment, this problem does not happen.
Here is my startup command: python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 7808 --model /mnt/home/Qwen1.5_32B_Chat --trust-remote-code --served-model-name Qwen --gpu-memory-utilization 0.9 --pipeline-parallel-size 2 --enforce-eager --max-model-len 8192,
and the error message is as follows: ERROR: [Errno 98] error while attempting to bind on address ('0.0.0.0', 7808): address already in use.
I have the same issue with a crashing vllm process (the one of the quickstart example), which I put in a while true loop, which I put in a run.sh
script:
#!/bin/bash
while true; do
echo "Starting vLLM..."
vllm serve facebook/opt-125m --port 19000 --device=cpu
done
It throws this error ERROR: [Errno 98] error while attempting to bind on address
several times, till it's gone and vlllm starts properly:
./run.sh
Starting vLLM...
WARNING 10-10 10:22:05 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:22:07 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:22:07 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fc7b7743100>)
INFO 10-10 10:22:07 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/41c04ac0-ca34-45d5-af92-a0f8017f943a for IPC Path.
INFO 10-10 10:22:07 api_server.py:177] Started engine process with PID 32638
WARNING 10-10 10:22:08 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
WARNING 10-10 10:22:10 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
WARNING 10-10 10:22:13 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 10-10 10:22:13 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-10 10:22:13 cpu_executor.py:325] float16 is not supported on CPU, casting to bfloat16.
WARNING 10-10 10:22:13 cpu_executor.py:328] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 10-10 10:22:13 cpu_executor.py:354] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 10-10 10:22:13 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-10 10:22:13 selector.py:116] Using XFormers backend.
/home/users/zoobab/mnt/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/users/zoobab/mnt/xformers/ops/fmha/flash/mnt/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-10 10:22:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-10 10:22:14 selector.py:116] Using XFormers backend.
INFO 10-10 10:22:14 weight_utils.py:242] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
/home/users/zoobab/mnt/xformers/ops/fmha/flash/mnt/vllm/model_executor/model_loader/weight_utils.py:424: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.70it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.69it/s]
INFO 10-10 10:22:14 cpu_executor.py:212] # CPU blocks: 7281
INFO 10-10 10:22:15 api_server.py:230] vLLM to use /tmp/tmpr4gzqks_ as PROMETHEUS_MULTIPROC_DIR
WARNING 10-10 10:22:15 serving_embedding.py:189] embedding_mode is False. Embedding API will not work.
INFO 10-10 10:22:15 launcher.py:19] Available routes are:
INFO 10-10 10:22:15 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 10-10 10:22:15 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 10-10 10:22:15 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 10-10 10:22:15 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 10-10 10:22:15 launcher.py:27] Route: /health, Methods: GET
INFO 10-10 10:22:15 launcher.py:27] Route: /tokenize, Methods: POST
INFO 10-10 10:22:15 launcher.py:27] Route: /detokenize, Methods: POST
INFO 10-10 10:22:15 launcher.py:27] Route: /v1/models, Methods: GET
INFO 10-10 10:22:15 launcher.py:27] Route: /version, Methods: GET
INFO 10-10 10:22:15 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 10-10 10:22:15 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 10-10 10:22:15 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [32463]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:19000 (Press CTRL+C to quit)
INFO 10-10 10:22:21 logger.py:36] Received request cmpl-7d8a9231fc4745b88b8883f111ef1275-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [2, 16033, 2659, 16, 10], lora_request: None, prompt_adapter_request: None.
INFO 10-10 10:22:21 engine.py:288] Added request cmpl-7d8a9231fc4745b88b8883f111ef1275-0.
ERROR 10-10 10:22:21 engine.py:157] TypeError("XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt'")
ERROR 10-10 10:22:21 engine.py:157] Traceback (most recent call last):
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-10 10:22:21 engine.py:157] self.run_engine_loop()
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-10 10:22:21 engine.py:157] request_outputs = self.engine_step()
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-10 10:22:21 engine.py:157] raise e
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-10 10:22:21 engine.py:157] return self.engine.step()
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/engine/llm_engine.py", line 1264, in step
ERROR 10-10 10:22:21 engine.py:157] outputs = self.model_executor.execute_model(
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/executor/cpu_executor.py", line 227, in execute_model
ERROR 10-10 10:22:21 engine.py:157] output = self.driver_method_invoker(self.driver_worker,
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/executor/cpu_executor.py", line 377, in _driver_method_invoker
ERROR 10-10 10:22:21 engine.py:157] return getattr(driver, method)(*args, **kwargs)
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/worker_base.py", line 303, in execute_model
ERROR 10-10 10:22:21 engine.py:157] inputs = self.prepare_input(execute_model_req)
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/worker_base.py", line 291, in prepare_input
ERROR 10-10 10:22:21 engine.py:157] return self._get_driver_input_and_broadcast(execute_model_req)
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
ERROR 10-10 10:22:21 engine.py:157] self.model_runner.prepare_model_input(
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/cpu_model_runner.py", line 494, in prepare_model_input
ERROR 10-10 10:22:21 engine.py:157] model_input = self._prepare_model_input_tensors(
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/cpu_model_runner.py", line 482, in _prepare_model_input_tensors
ERROR 10-10 10:22:21 engine.py:157] return builder.build() # type: ignore
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/cpu_model_runner.py", line 130, in build
ERROR 10-10 10:22:21 engine.py:157] multi_modal_kwargs) = self._prepare_prompt(
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/worker/cpu_model_runner.py", line 265, in _prepare_prompt
ERROR 10-10 10:22:21 engine.py:157] attn_metadata = self.attn_backend.make_metadata(
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] File "/home/users/zoobab/mnt/vllm/attention/backends/abstract.py", line 47, in make_metadata
ERROR 10-10 10:22:21 engine.py:157] return cls.get_metadata_cls()(*args, **kwargs)
ERROR 10-10 10:22:21 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-10 10:22:21 engine.py:157] TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt'
INFO: 127.0.0.1:40978 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/users/zoobab/mnt/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/zoobab/mnt/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/zoobab/mnt/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/home/users/zoobab/mnt/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/home/users/zoobab/mnt/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/home/users/zoobab/mnt/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/home/users/zoobab/mnt/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/users/zoobab/mnt/starlette/_exception_handler.py", line 62, in wrapped_app
raise exc
File "/home/users/zoobab/mnt/starlette/_exception_handler.py", line 51, in wrapped_app
await app(scope, receive, sender)
File "/home/users/zoobab/mnt/starlette/routing.py", line 73, in app
response = await f(request)
^^^^^^^^^^^^^^^^
File "/home/users/zoobab/mnt/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/zoobab/mnt/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 328, in create_completion
generator = await completion(raw_request).create_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/serving_completion.py", line 187, in create_completion
async for i, res in result_generator:
File "/home/users/zoobab/mnt/vllm/utils.py", line 490, in merge_async_iterators
item = await d
^^^^^^^
File "/home/users/zoobab/mnt/vllm/engine/multiprocessing/client.py", line 486, in _process_request
raise request_output
TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt'
CRITICAL 10-10 10:22:22 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 127.0.0.1:41012 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [32463]
Starting vLLM...
WARNING 10-10 10:22:26 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:22:29 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:22:29 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f7d7d1ef100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:22:33 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:22:35 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:22:35 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f8712c1f100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:22:39 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:22:41 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:22:41 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f77abc83100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:22:45 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:22:48 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:22:48 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f8a1df8b100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:22:51 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:22:54 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:22:54 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f129af63100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:22:58 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:23:00 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:23:00 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fe9fd51b100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:23:04 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:23:07 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:23:07 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f5e7bdfb100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:23:10 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:23:13 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:23:13 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fa91a807100>)
Traceback (most recent call last):
File "/home/users/zoobab/mnt/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/users/zoobab/mnt/vllm/scripts.py", line 165, in main
args.dispatch_function(args)
File "/home/users/zoobab/mnt/vllm/scripts.py", line 37, in serve
uvloop.run(run_server(args))
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/applis/12401-icpyd-00/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/home/users/zoobab/mnt/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/users/zoobab/mnt/vllm/entrypoints/openai/api_server.py", line 530, in run_server
temp_socket.bind(("", args.port))
OSError: [Errno 98] Address already in use
Starting vLLM...
WARNING 10-10 10:23:16 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
INFO 10-10 10:23:19 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
INFO 10-10 10:23:19 api_server.py:527] args: Namespace(model_tag='facebook/opt-125m', config='', host=None, port=19000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='facebook/opt-125m', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7fc744e53100>)
INFO 10-10 10:23:19 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/f8559388-9727-4850-b6d0-b6531b9d708c for IPC Path.
INFO 10-10 10:23:19 api_server.py:177] Started engine process with PID 6987
WARNING 10-10 10:23:20 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
WARNING 10-10 10:23:21 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
WARNING 10-10 10:23:25 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 10-10 10:23:25 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 10-10 10:23:25 cpu_executor.py:325] float16 is not supported on CPU, casting to bfloat16.
WARNING 10-10 10:23:25 cpu_executor.py:328] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 10-10 10:23:25 cpu_executor.py:354] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 10-10 10:23:25 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-10 10:23:25 selector.py:116] Using XFormers backend.
/home/users/zoobab/mnt/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/users/zoobab/mnt/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-10 10:23:25 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-10 10:23:25 selector.py:116] Using XFormers backend.
INFO 10-10 10:23:26 weight_utils.py:242] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
/home/users/zoobab/mnt/vllm/model_executor/model_loader/weight_utils.py:424: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.99it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.99it/s]
INFO 10-10 10:23:26 cpu_executor.py:212] # CPU blocks: 7281
INFO 10-10 10:23:27 api_server.py:230] vLLM to use /tmp/tmpj8q6uijh as PROMETHEUS_MULTIPROC_DIR
WARNING 10-10 10:23:27 serving_embedding.py:189] embedding_mode is False. Embedding API will not work.
INFO 10-10 10:23:27 launcher.py:19] Available routes are:
INFO 10-10 10:23:27 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 10-10 10:23:27 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 10-10 10:23:27 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 10-10 10:23:27 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 10-10 10:23:27 launcher.py:27] Route: /health, Methods: GET
INFO 10-10 10:23:27 launcher.py:27] Route: /tokenize, Methods: POST
INFO 10-10 10:23:27 launcher.py:27] Route: /detokenize, Methods: POST
INFO 10-10 10:23:27 launcher.py:27] Route: /v1/models, Methods: GET
INFO 10-10 10:23:27 launcher.py:27] Route: /version, Methods: GET
INFO 10-10 10:23:27 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 10-10 10:23:27 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 10-10 10:23:27 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO: Started server process [6593]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:19000 (Press CTRL+C to quit)
Your current environment
The output of `python collect_env.py`
```text Your output of `python collect_env.py` here ```🐛 Describe the bug
Hello,
On a container env I can launch vllm no issues but if i stop and relaunch the pod i get
it doesn't make sense since the pods are isolated in the cluster env even made sure that accross the cluster 8000 is used by no one else even though it doesn't make sense. The best part is that if I wait some time like half hour it works again, this only occurs with vllm and not other services or inference backends, any hints ?
launch commands