Open vincent-pli opened 5 months ago
Any updates on the fix?
Hi! @vincent-pli @shivam-dubey-1 @anyscalesam I checked it with version vllm==0.6.3.post1. I've found some issues too, but in another direction:
INFO 2024-11-18 01:46:34,210 default_VLLMDeployment ujj4diu5 llm_vllm_llama_31.py:47 - Starting with engine args: AsyncEngineArgs(model='NousResearch/Meta-Llama-3-8B-Instruct', served_model_name=None, tokenizer='NousResearch/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=None, worker_use_ray=True, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, mm_processor_kwargs=None, scheduling_policy='fcfs', disable_log_requests=False)
INFO 2024-11-18 01:47:09,627 default_VLLMDeployment ujj4diu5 replica.py:915 - Finished initializing replica.
INFO 2024-11-18 01:47:23,213 default_VLLMDeployment ujj4diu5 replica.py:1108 - Started executing request to method '__call__'.
INFO 2024-11-18 01:47:23,220 default_VLLMDeployment ujj4diu5 06b27f50-9747-4bc4-8b29-b8b757380e53 /v1/chat/completions llm_vllm_llama_31.py:68 - Model config: <vllm.config.ModelConfig object at 0x7f15ac887150>
INFO 2024-11-18 01:47:23,220 default_VLLMDeployment ujj4diu5 06b27f50-9747-4bc4-8b29-b8b757380e53 /v1/chat/completions llm_vllm_llama_31.py:74 - Served model names: ['NousResearch/Meta-Llama-3-8B-Instruct']
INFO 2024-11-18 01:47:23,220 default_VLLMDeployment ujj4diu5 06b27f50-9747-4bc4-8b29-b8b757380e53 /v1/chat/completions llm_vllm_llama_31.py:85 - Request: messages=[{'content': 'Hello!', 'role': 'user'}] model='NousResearch/Meta-Llama-3-8B-Instruct' frequency_penalty=0.0 logit_bias=None logprobs=False top_logprobs=0 max_tokens=None n=1 presence_penalty=0.0 response_format=None seed=None stop=[] stream=False stream_options=None temperature=0.7 top_p=1.0 tools=None tool_choice='none' parallel_tool_calls=False user=None best_of=None use_beam_search=False top_k=-1 min_p=0.0 repetition_penalty=1.0 length_penalty=1.0 stop_token_ids=[] include_stop_str_in_output=False ignore_eos=False min_tokens=0 skip_special_tokens=True spaces_between_special_tokens=True truncate_prompt_tokens=None prompt_logprobs=None echo=False add_generation_prompt=True continue_final_message=False add_special_tokens=False documents=None chat_template=None chat_template_kwargs=None guided_json=None guided_regex=None guided_choice=None guided_grammar=None guided_decoding_backend=None guided_whitespace_pattern=None priority=0
ERROR 2024-11-18 01:47:23,225 default_VLLMDeployment ujj4diu5 06b27f50-9747-4bc4-8b29-b8b757380e53 /v1/chat/completions replica.py:362 - Request failed:
[36mray::ServeReplica:default:VLLMDeployment.handle_request_with_rejection()[39m (pid=1383, ip=172.19.0.2)
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/utils.py", line 168, in wrap_to_ray_error
raise exception
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1148, in call_user_method
await self._call_func_or_gen(
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 872, in _call_func_or_gen
result = await result
^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/serve/_private/http_util.py", line 459, in __call__
await self._asgi_app(
File "/home/ray/anaconda3/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 758, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/home/ray/anaconda3/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
response = await func(request)
Description
make the vllm example with latest vllm version(v0.4.3) works, by follow the current example from https://docs.ray.io/en/master/serve/tutorials/vllm-example.html I got exception:
cause by missing parameters in: https://github.com/ray-project/ray/blob/c4a87ee474041ab7286a41378f3f6db904e0e3c5/doc/source/serve/doc_code/vllm_openai_example.py#L53
the
OpenAIServingChat
requiresModelConfig
as the second parameterI will make a pr the fix it latter
Link
https://docs.ray.io/en/master/serve/tutorials/vllm-example.html