Open trianxy opened 1 year ago
While debugging the above, I stumbled upon this and this code line which seem to be related. There, if context length is NOT 1, self.layer_past
is not reset and - possibly - some state of an earlier and unrelated model output will be used!
I am happy to debug this further but would need some pointers why those lines were added. Maybe you, @awan-10 , can help me debug this further? (I see that you may have worked around the above code lines in the past)
Also, I am probably missing some insights:
Upon watching layer_past
travel across several methods, I see that it ends up in DeepSpeedAttention.compute_attention(...) but it is not used inside that function. Perhaps the latter is overridden, but I don't know with what/how.
Based on this issue, the above non-deterministic behavior happens, because DeepSpeed assumes when you input 1 token id, that you are looking for a continuation of what was inputted before. And it uses internal past cache for that.
Here is a more vivid example of this bug. Check how the model predicts the token 5
, apparently assuming that you want it to predict the next token after the prompt 5 10 5 10
, but without us inputting that prompt or its past key values:
from typing import Optional, Any
import torch
import deepspeed
from transformers import AutoTokenizer, AutoModelForCausalLM
ARCHITECTURE = "gpt2"
model = AutoModelForCausalLM.from_pretrained(ARCHITECTURE).to("cuda").eval()
tokenizer = AutoTokenizer.from_pretrained(ARCHITECTURE, use_fast=True)
engine = deepspeed.init_inference(model, dtype=torch.float16, replace_with_kernel_inject=True)
model = engine.module
def print_next_token_after_prompt(prompt: str, pkv: Optional[tuple]) -> None:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model(input_ids=inputs["input_ids"], past_key_values=pkv)
token_id = torch.argmax(output.logits[0][-1])
token = tokenizer.decode(token_id)
print(f"{prompt=} -> {token=}")
print_next_token_after_prompt(prompt=" 5 10 5", pkv=None) # prompt=' 5 10 5' -> token=' 10'
print_next_token_after_prompt(prompt=" 10", pkv=None) # prompt=' 10' -> token=','
empty_past_key_values = model.config.n_layer * ((torch.Tensor([[[]]]), torch.Tensor([[[]]])))
print_next_token_after_prompt(prompt=" 5 10 5", pkv=None) # prompt=' 5 10 5' -> token=' 10'
print_next_token_after_prompt(prompt=" 10", pkv=empty_past_key_values) # prompt=' 10' -> token=' 5'
@trianxy - Thanks for tagging me. I think the only person who can explain this will be @RezaYazdaniAminabadi :)
Describe the bug For the models GPT-neo-1.3B, Bloom 1b7, Pythia 1.4b, GPT2-xl, I get non-deterministic model outputs when using context length 1 and
engine = deepspeed.init_inference(model, dtype=torch.float16, replace_with_kernel_inject=True)
.The context length 1 may let this sound like a low-priority bug, BUT it may be not: When using transformer's
model.generate
, the context length of the ids may be cut down to 1 (because of the use ofpast_key_values
to speed up model inference). In particular, the above bug is a blocker to me rewriting transformer'smodel.generate
for my needs.To Reproduce
Expected behavior The 10 print outs after
test_if_model_is_deterministic(engine.module, tokenizer, device=DEVICE)
should not change. They should be identical to the print outs aftertest_if_model_is_deterministic(engine.module, tokenizer, device=DEVICE)
, but they are not.ds_report output
System info (please complete the following information):