Some functional problems in the implementation of Speculative Decoding

transcend-0 commented 6 months ago

System Info

Python 3.10.11 transformers 4.40.0 torch 2.0.1 Linux version 4.15.0-55-generic x86_64

Who can help?

@ArthurZucker @gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Checkpoint: vicuna-7b-v1.3, vicuna-68m

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch
import os
import time

class Timer:
    def __init__(self, func):
        self.func = func

    def __call__(self, *args, **kwargs):
        self.start = time.time()
        result = self.func(*args, **kwargs)
        print(f'- {self.func.__name__} - Running time: {time.time() - self.start :.2f} s\n')
        return result

set_seed(42)
os.environ["CUDA_VISIBLE_DEVICES"]='0'
device = "cuda"
torch.cuda.empty_cache()

prompt = "Long long ago"

target_checkpoint = "./models/vicuna-7b-v1.3/"
draft_checkpoint = "./models/vicuna-68m/"

tokenizer = AutoTokenizer.from_pretrained(target_checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

target_model = AutoModelForCausalLM.from_pretrained(target_checkpoint).to(device)
draft_model = AutoModelForCausalLM.from_pretrained(draft_checkpoint).to(device)

target_model.generation_config.update(**{
    'max_new_tokens': 12,
    'do_sample': False,
    'temperature': None,
    'top_k': None,
})
draft_model.generation_config.update(**{
    'num_assistant_tokens': 3,  # The functionality of this parameter may be better to set in target_model
                                # because the parameter "assistant_model" needs to be set in target_model.
                                # But in fact, it has to set in draft_model to achieve this.
    'num_assistant_tokens_schedule': 'constant',
    'do_sample': True,  # The parameter "do_sample" in target_model will override that in draft_model? It doesn't seem reasonable.
    'temperature': 0.7,
})

@Timer
@torch.no_grad()
def targetDecoding():
    outputs = target_model.generate(**inputs)
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

@Timer
@torch.no_grad()
def draftDecoding():
    outputs = draft_model.generate(**inputs)
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

@Timer
@torch.no_grad()
def speculativeDecoding():
    outputs = target_model.generate(**inputs, assistant_model=draft_model)
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

targetDecoding()
#draftDecoding()
speculativeDecoding()

Outputs:

['Long long ago, in a galaxy far, far away, there was']
- targetDecoding - Running time: 1.00 s

['Long long ago, in the days of old,\nAnd the days the']
- speculativeDecoding - Running time: 0.42 s

Expected behavior

Hello! Huggingface team!

There is a strange problem in Speculative Decoding (Assisted Decoding). Maybe there are some bugs in the implementation of the feature.

In theory (speculative decoding paper), when the parameter _'dosample' in the target model is set to False (greedy search), the output of the target model is determined, and the result of speculative decoding (speculativeDecoding()) should be the same as that of the target model (targetDecoding()). But in the above code, this is not the case. Why is that?
Except for _'num_assistanttokens', _'num_assistant_tokensschedule' and _'max_newtokens', _generationconfig of _targetmodel seems to override that of _draftmodel in Speculative Decoding, which is a little unreasonable.

zucchini-nlp commented 6 months ago

@transcend-0 hey!

The issue was solved in #30068. You can install transformers from main with the following line for the correct generation with assisted decoding:

!pip install --upgrade git+https://github.com/huggingface/transformers.git

transcend-0 commented 6 months ago

@zucchini-nlp Thank you very much! 💛 But Issue 2. (_generationconfig of _draftmodel would be overwritten by _targetmodel) is not yet settled, which may be worth considering.

zucchini-nlp commented 6 months ago

@transcend-0 Did not notice the second point about generation config.

I think overriding the generation config of draft model by target model is done for performance enhancement, so that the assistant model has the same generation logic as the target model. Maybe @gante had empirical evidence for that

gante commented 5 months ago

Hey @transcend-0 👋

The matching of generation config in the assistant model is done to ensure the assistant sees the same flags, ideally causing equivalent distribution shifts on both models. For instance, if we set generate to bias certain tokens, then we also want the assistant model to have the same bias (to maximize the matches) 🤗

gante commented 5 months ago

@transcend-0 likewise, the assistant should generate greedily (see the PR above) :)

transcend-0 commented 5 months ago

Thank you very much!

huggingface / transformers