Open itsdaniele opened 2 months ago
Do you see any stats from the engine? You should see something like:
Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of accepted tokens: 32594, Number of draft tokens: 53716, Number of emitted tokens: 34244.
Do you see any stats from the engine? You should see something like:
Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of accepted tokens: 32594, Number of draft tokens: 53716, Number of emitted tokens: 34244.
Sorry I should have been more clear, I am trying to run this using offline inference as in the spec decoding tutorial.
My code is the following:
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_model="facebook/opt-125m",
num_speculative_tokens=5,
use_v2_block_manager=True,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
And I can't seem to find the acceptance rate anywhere in the output.
I have also tried running the OpenAI API example from the tutorial and then going to the /metrics endpoints, but I don't see the acceptance rate there either.
The acceptance rate stats will print every 5s, try this:
#!/usr/bin/env python3
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="facebook/opt-6.7b",
tensor_parallel_size=1,
speculative_model="facebook/opt-125m",
num_speculative_tokens=5,
use_v2_block_manager=True,
disable_log_stats=False,
)
outputs = llm.generate(prompts, sampling_params)
import time
time.sleep(5)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
This can obviously be improved (maybe a flag to customize the interval?). LMK if you are interested in this and I can give you code pointers.
Thank you! This worked great.
This can obviously be improved (maybe a flag to customize the interval?). LMK if you are interested in this and I can give you code pointers.
Thanks, would be great to have some pointers.
The metrics are currently cumulative over the lifetime of the server.
@cadedaniel I still can't see the metrics of Acceptance rate , even if i have slept 5s or even more . My code is the following: here is my code
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="/root/autodl-tmp/Qwen-7B",
tensor_parallel_size=1,
speculative_model="/root/autodl-tmp/Qwen-1_8B",
num_speculative_tokens=1,
use_v2_block_manager=True,
disable_log_stats=False,
trust_remote_code=True,
max_model_len = 2048
)
outputs = llm.generate(prompts, sampling_params)
import time
time.sleep(50)
print("after 50s later \n")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
and here is the output in CLI
INFO 08-19 15:23:12 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-19 15:23:12 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-19 15:23:12 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/root/autodl-tmp/Qwen-7B', speculative_config=SpeculativeConfig(draft_model='/root/autodl-tmp/Qwen-1_8B', num_spec_tokens=1), tokenizer='/root/autodl-tmp/Qwen-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/autodl-tmp/Qwen-7B, use_v2_block_manager=True, enable_prefix_caching=False)
/root/miniconda3/envs/llama_index/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
WARNING 08-19 15:23:12 tokenizer.py:129] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 08-19 15:23:12 spec_decode_worker.py:156] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.multi_step_worker.MultiStepWorker'>
INFO 08-19 15:23:12 spec_decode_worker.py:170] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
INFO 08-19 15:23:13 model_runner.py:720] Starting to load model /root/autodl-tmp/Qwen-7B...
Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 12% Completed | 1/8 [00:00<00:03, 1.90it/s]
Loading safetensors checkpoint shards: 25% Completed | 2/8 [00:01<00:03, 1.72it/s]
Loading safetensors checkpoint shards: 38% Completed | 3/8 [00:01<00:02, 1.68it/s]
Loading safetensors checkpoint shards: 50% Completed | 4/8 [00:02<00:02, 1.68it/s]
Loading safetensors checkpoint shards: 62% Completed | 5/8 [00:02<00:01, 1.66it/s]
Loading safetensors checkpoint shards: 75% Completed | 6/8 [00:03<00:01, 1.67it/s]
Loading safetensors checkpoint shards: 88% Completed | 7/8 [00:03<00:00, 1.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00, 1.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00, 1.76it/s]
INFO 08-19 15:23:18 model_runner.py:732] Loading model weights took 14.3919 GB
INFO 08-19 15:23:18 model_runner.py:720] Starting to load model /root/autodl-tmp/Qwen-1_8B...
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.68it/s]
INFO 08-19 15:23:18 model_runner.py:732] Loading model weights took 3.4594 GB
INFO 08-19 15:23:19 gpu_executor.py:102] # GPU blocks: 190, # CPU blocks: 512
INFO 08-19 15:23:21 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-19 15:23:21 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-19 15:23:31 model_runner.py:1225] Graph capturing finished in 10 secs.
INFO 08-19 15:23:31 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-19 15:23:31 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-19 15:23:40 model_runner.py:1225] Graph capturing finished in 8 secs.
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 08-19 15:23:40 multi_step.py:57] Prompt logprob is not supported by multi step workers. (e.g., speculative decode uses multi step workers).
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.05it/s, est. speed input: 10.28 toks/s, output: Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.05it/s, est. speed input: 10.28 toks/s, output: 32.88 toks/s]
after 50s later
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output:Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.73it/s, est. speed input: 18.67 toks/s, output: Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.73it/s, est. speed input: 18.67 toks/s, output: 59.75 toks/s]
Prompt: 'The future of AI is', Generated text: ' exciting, and I am thrilled to be a part of this journey. With the'
Can you add a print here to verify that the acceptance rate metrics are being collected?
They should be printed here: https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/engine/metrics.py#L386-L392
Can you add a print here to verify that the acceptance rate metrics are being collected?
They should be printed here:
thanks ,I have solved this problem just by switching the vllm version to the latest
I have been running the scripts from https://docs.vllm.ai/en/latest/models/spec_decode.html on how to do speculative decoding with vLLM.
However, it seems that the acceptance rate is not shown/outputted anywhere. Is there any way of computing it/accessing it?