[Usage]: Acceptance rate for Speculative Decoding

itsdaniele commented 2 months ago

I have been running the scripts from https://docs.vllm.ai/en/latest/models/spec_decode.html on how to do speculative decoding with vLLM.

However, it seems that the acceptance rate is not shown/outputted anywhere. Is there any way of computing it/accessing it?

cadedaniel commented 2 months ago

Do you see any stats from the engine? You should see something like:

Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of accepted tokens: 32594, Number of draft tokens: 53716, Number of emitted tokens: 34244.

itsdaniele commented 2 months ago

Do you see any stats from the engine? You should see something like:

Speculative metrics: Draft acceptance rate: 0.607, System efficiency: 0.510, Number of speculative tokens: 4, Number of accepted tokens: 32594, Number of draft tokens: 53716, Number of emitted tokens: 34244.

Sorry I should have been more clear, I am trying to run this using offline inference as in the spec decoding tutorial.

My code is the following:

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

And I can't seem to find the acceptance rate anywhere in the output.

I have also tried running the OpenAI API example from the tutorial and then going to the /metrics endpoints, but I don't see the acceptance rate there either.

cadedaniel commented 2 months ago

The acceptance rate stats will print every 5s, try this:

#!/usr/bin/env python3

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
    disable_log_stats=False,
)

outputs = llm.generate(prompts, sampling_params)

import time
time.sleep(5)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

This can obviously be improved (maybe a flag to customize the interval?). LMK if you are interested in this and I can give you code pointers.

itsdaniele commented 2 months ago

Thank you! This worked great.

itsdaniele commented 2 months ago

This can obviously be improved (maybe a flag to customize the interval?). LMK if you are interested in this and I can give you code pointers.

Thanks, would be great to have some pointers.

cadedaniel commented 2 months ago

The metrics are currently cumulative over the lifetime of the server.

xdtyjwj commented 2 months ago

@cadedaniel I still can't see the metrics of Acceptance rate , even if i have slept 5s or even more . My code is the following: here is my code

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="/root/autodl-tmp/Qwen-7B",
    tensor_parallel_size=1,
    speculative_model="/root/autodl-tmp/Qwen-1_8B",
    num_speculative_tokens=1,
    use_v2_block_manager=True,
    disable_log_stats=False,
    trust_remote_code=True,
    max_model_len = 2048
)

outputs = llm.generate(prompts, sampling_params)
import time

time.sleep(50)
print("after 50s later \n")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

and here is the output in CLI


INFO 08-19 15:23:12 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-19 15:23:12 config.py:1450] Downcasting torch.float32 to torch.float16.
INFO 08-19 15:23:12 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/root/autodl-tmp/Qwen-7B', speculative_config=SpeculativeConfig(draft_model='/root/autodl-tmp/Qwen-1_8B', num_spec_tokens=1), tokenizer='/root/autodl-tmp/Qwen-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/autodl-tmp/Qwen-7B, use_v2_block_manager=True, enable_prefix_caching=False)
/root/miniconda3/envs/llama_index/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
WARNING 08-19 15:23:12 tokenizer.py:129] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 08-19 15:23:12 spec_decode_worker.py:156] Configuring SpecDecodeWorker with proposer=<class 'vllm.spec_decode.multi_step_worker.MultiStepWorker'>
INFO 08-19 15:23:12 spec_decode_worker.py:170] Configuring SpecDecodeWorker with sampler=<class 'vllm.model_executor.layers.rejection_sampler.RejectionSampler'>
INFO 08-19 15:23:13 model_runner.py:720] Starting to load model /root/autodl-tmp/Qwen-7B...
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  12% Completed | 1/8 [00:00<00:03,  1.90it/s]
Loading safetensors checkpoint shards:  25% Completed | 2/8 [00:01<00:03,  1.72it/s]
Loading safetensors checkpoint shards:  38% Completed | 3/8 [00:01<00:02,  1.68it/s]
Loading safetensors checkpoint shards:  50% Completed | 4/8 [00:02<00:02,  1.68it/s]
Loading safetensors checkpoint shards:  62% Completed | 5/8 [00:02<00:01,  1.66it/s]
Loading safetensors checkpoint shards:  75% Completed | 6/8 [00:03<00:01,  1.67it/s]
Loading safetensors checkpoint shards:  88% Completed | 7/8 [00:03<00:00,  1.86it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00,  1.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:04<00:00,  1.76it/s]

INFO 08-19 15:23:18 model_runner.py:732] Loading model weights took 14.3919 GB
INFO 08-19 15:23:18 model_runner.py:720] Starting to load model /root/autodl-tmp/Qwen-1_8B...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.68it/s]

INFO 08-19 15:23:18 model_runner.py:732] Loading model weights took 3.4594 GB
INFO 08-19 15:23:19 gpu_executor.py:102] # GPU blocks: 190, # CPU blocks: 512
INFO 08-19 15:23:21 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-19 15:23:21 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-19 15:23:31 model_runner.py:1225] Graph capturing finished in 10 secs.
INFO 08-19 15:23:31 model_runner.py:1024] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-19 15:23:31 model_runner.py:1028] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-19 15:23:40 model_runner.py:1225] Graph capturing finished in 8 secs.
Processed prompts:   0%|                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 08-19 15:23:40 multi_step.py:57] Prompt logprob is not supported by multi step workers. (e.g., speculative decode uses multi step workers).
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s, est. speed input: 10.28 toks/s, output: Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s, est. speed input: 10.28 toks/s, output: 32.88 toks/s]
after 50s later 

Processed prompts:   0%|                                                                                                                                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output:Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.73it/s, est. speed input: 18.67 toks/s, output: Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.73it/s, est. speed input: 18.67 toks/s, output: 59.75 toks/s]
Prompt: 'The future of AI is', Generated text: ' exciting, and I am thrilled to be a part of this journey. With the'

cadedaniel commented 2 months ago

Can you add a print here to verify that the acceptance rate metrics are being collected?

https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/spec_decode/metrics.py#L84-L98

They should be printed here: https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/engine/metrics.py#L386-L392

xdtyjwj commented 2 months ago

Can you add a print here to verify that the acceptance rate metrics are being collected?

https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/spec_decode/metrics.py#L84-L98

They should be printed here:

https://github.com/vllm-project/vllm/blob/c6af027a35b657b20ec60adac77cb75264b65a98/vllm/engine/metrics.py#L386-L392

thanks ,I have solved this problem just by switching the vllm version to the latest

vllm-project / vllm

[Usage]: Acceptance rate for Speculative Decoding #7301