stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
17.49k stars 1.33k forks source link

Error with VLLM #1136

Closed didoll-john closed 3 months ago

didoll-john commented 3 months ago

I started the VLLM via docker, command is: docker run --runtime nvidia --gpus all -d --restart always -v ~/data/.cache/huggingface:/root/.cache/huggingface -v /data/LLM_models/Qwen/Qwen2-72B-Instruct-GPTQ-Int4:/data/Qwen2-72B-Instruct-GPTQ-Int4 -p 8000:8000 --ipc=host vllm/vllm-openai:latest --served-model-name Qwen2-72B-Instruct-GPTQ-Int4 --model /data/Qwen2-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 4 do the curl test: curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen2-72B-Instruct-GPTQ-Int4", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "give me the answer of 1+1"} ] }' it works well. {"id":"cmpl-e44f7f809f5b4ebc82eff1e96c55ad1b","object":"chat.completion","created":1718184574,"model":"Qwen2-72B-Instruct-GPTQ-Int4","choices":[{"index":0,"message":{"role":"assistant","content":"The answer to 1 + 1 is 2.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":28,"total_tokens":41,"completion_tokens":13}} Then I use it in the dspy like: vllm_qwen = dspy.HFClientVLLM(model="Qwen2-72B-Instruct-GPTQ-Int4", port=8000, url="http://localhost") the error code is: Failed to parse JSON response: {"object":"error","message":"The model Qwen2-72B-Instruct-GPTQ-Int4 does not exist.","type":"NotFoundError","param":null,"code":404} then I try to use it like this: vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http://localhost:8000/v1", api_key='EMPTY') got another error code: openai.NotFoundError: Error code: 404 - {'detail': 'Not Found'}

tom-doerr commented 3 months ago

Could it make sense to look at the container logs? docker logs <container_id>

didoll-john commented 3 months ago

INFO 06-12 12:36:44 api_server.py:177] vLLM API server version 0.5.0 INFO 06-12 12:36:44 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['Qwen2-72B-Instruct-GPTQ-Int4'], qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 06-12 12:36:44 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. 2024-06-12 12:36:47,341 INFO worker.py:1753 -- Started a local Ray instance. INFO 06-12 12:36:48 config.py:623] Defaulting to use mp for distributed inference INFO 06-12 12:36:48 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='/data/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen2-72B-Instruct-GPTQ-Int4) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2 INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5 INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5 Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/psm_e421e9cd' Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/psm_e421e9cd' Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main cache[rtype].remove(name) KeyError: '/psm_e421e9cd' WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=4701) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=4699) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=4700) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=4699) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB (VllmWorkerProcess pid=4701) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB (VllmWorkerProcess pid=4700) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB INFO 06-12 12:39:58 distributed_gpu_executor.py:56] # GPU blocks: 6049, # CPU blocks: 3276 (VllmWorkerProcess pid=4699) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=4699) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (VllmWorkerProcess pid=4700) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=4700) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (VllmWorkerProcess pid=4701) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=4701) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (VllmWorkerProcess pid=4699) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs. (VllmWorkerProcess pid=4700) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs. (VllmWorkerProcess pid=4701) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 26 secs. INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-12 12:40:30 serving_chat.py:92] Using default chat template: INFO 06-12 12:40:30 serving_chat.py:92] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system INFO 06-12 12:40:30 serving_chat.py:92] You are a helpful assistant.<|im_end|> INFO 06-12 12:40:30 serving_chat.py:92] ' }}{% endif %}{{'<|im_start|>' + message['role'] + ' INFO 06-12 12:40:30 serving_chat.py:92] ' + message['content'] + '<|im_end|>' + ' INFO 06-12 12:40:30 serving_chat.py:92] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant INFO 06-12 12:40:30 serving_chat.py:92] ' }}{% endif %} Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-12 12:40:31 serving_embedding.py:141] embedding_mode is False. Embedding API will not work. INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO 06-12 12:40:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:40:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:41:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:41:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:41:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:41:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:41:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:41:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:42:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:42:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:42:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:42:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:42:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:42:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:43:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:43:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:43:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:43:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:43:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:43:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 06-12 12:44:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

above is the log when I use: vllm_qwen = dspy.HFClientVLLM(model="Qwen2-72B-Instruct-GPTQ-Int4", port=8000, url="http://localhost") got the error: Failed to parse JSON response: {"object":"error","message":"The model Qwen2-72B-Instruct-GPTQ-Int4 does not exist.","type":"NotFoundError","param":null,"code":404}

you can see that there is no exactly log for this request.

didoll-john commented 3 months ago

And I found that I can use it as openai api now, the point is: vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http://localhost:8000/v1/", api_key='EMPTY') There should be a "/" after "v1"

fivejjs commented 3 months ago

I will try openAI API with VLLM

fivejjs commented 3 months ago

And I found that I can use it as openai api now, the point is: vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http://localhost:8000/v1/", api_key='EMPTY') There should be a "/" after "v1"

the solution worked for hugging face models which could be updated into: https://github.com/stanfordnlp/dspy/blob/804a9749a964db153bb418c3e4aff0419ad93235/docs/api/local_language_model_clients/vLLM.md#L4