model.generate returns strange output shape

Quang-elec44 commented 8 months ago

Hi, I'm testing the latest version with TinyLlama/TinyLlama-1.1B-Chat-v0.3. Here is the full script:

from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")

model = AutoModelForCausalLM.from_pretrained(model_id)  # Tesla T4 does not support fp8

CHAT_EOS_TOKEN_ID = 32002

prompt = "How to get in a good university?"
formatted_prompt = (
    f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
)

generated_ids = model.generate(
    **model_inputs, 
    temperature=0.9,
    top_k=40, 
    top_p=0.7, 
    repetition_penalty=10,
    eos_token_id=CHAT_EOS_TOKEN_ID,
    max_new_tokens=128,
    min_length=30
)
print(tokenizer.batch_decode(generated_ids[0][0], skip_special_tokens=True)[0])

The generated_ids is a tuple with two values like this:

(tensor([[[    1,  1128,   304,  ..., 32002, 32002, 32002]]], device='cuda:0',
        dtype=torch.int32),
 tensor([[137]], device='cuda:0', dtype=torch.int32))

The output is ok:

How to get in a good university?
How do I know if my school is really rigorous and challenging enough for me, rather than just pushing too hard so that they make more money from the admissions office. And how can you tell what makes someone else's experience miserable or joyful when your own has been mostly boring as hell except reading Tolkien while waiting on hours lines at Starbucks (for no reason). 😂 Can we stop comparing ourselves with others already ? Shouldn’t each one of us be unique even though our schools are similar anyway….. Or should it simply come down only being human , not having

Can you explain the output? BTW: Can you provide a document for generation_config since it's different from HF (some params are not supported)

mfuntowicz commented 8 months ago

Thanks for opening this issue @Quang-elec44

The returned output is a tuple with two tensors:

[0] is the generated token tensor with shape (batch, num_beams, sequence_length)
[1] is the generated output length which has shape (batch, num_beams)

We will provide better documentation soon on this

Hope it helps :)

omihub777 commented 3 months ago

Why is the return value implemented as tuple though? This discrepancy in the returned shape from transformers.generate() (i.e. tuple vs torch.Tensor) prevents me from reusing the exact same code for both models during inference. I would like to know if there are specific reasons for that. Thank you!

huggingface / optimum-nvidia

model.generate returns strange output shape #57