huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.39k stars 754 forks source link

Llamav2 inference - confusing prompts #1381

Open shubhamagarwal92 opened 1 year ago

shubhamagarwal92 commented 1 year ago

Hi,

It is not clear if we need to follow the prompt template for inference using pipeline as mentioned here or do we need to follow the pipeline code without special tokens as defined here.

Let's say with modified example code here:

from transformers import AutoTokenizer
import transformers

model = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model, model_max_length=3500, truncation=True)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

system_prompt = 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?'
text = f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n"

sequences = pipeline(
    text,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    return_full_text=False,
    max_new_tokens=300,
    temperature= 0.9
)

Questions:

  1. If we need to control the length of input sequences should we initialize tokenizer with model_max_length=X, truncation=True?
  2. Shouldn't we then also pass the tokenizer when defining pipeline as above?
  3. If we need to also control the length of output sequences, should we pass max_new_tokens=X to pipeline?
  4. So, model_max_length is independent of max_new_tokens? Or is it model_max_length = input_length + max_new_tokens?
  5. In the code above, do we need to pass system_prompt or text when calling pipeline?
  6. Does this change when we are calling dialog models like 7B-chat/13-chat/70B-chat compared to 7B/13B/70B models?
  7. What about finetuning on our dataset? Do we need to provide the input text as prompts with special tokens for base/chat models?

Related issues here: https://github.com/huggingface/transformers/issues/4501 https://github.com/facebookresearch/llama-recipes/issues/114

Thanks in advance!

cc @pirj @osanseviero

pirj commented 1 year ago

You probably meant someone else, not @pirj

shubhamagarwal92 commented 1 year ago

Ahh sorry for that! Your name was getting recommended by Github!

osanseviero commented 1 year ago

cc @pcuenca and @philschmid as well here

If we need to control the length of input sequences should we initialize tokenizer with model_max_length=X, truncation=True?

Yes.

Shouldn't we then also pass the tokenizer when defining pipeline as above?

pipeline automatically picks the tokenizer of the corresponding model, so specifying the tokenizer is not needed

If we need to also control the length of output sequences, should we pass max_new_tokens=X to pipeline?

You can pass generation params as you said (but during inference, not loading). I recommend to check the docs for generation https://huggingface.co/docs/transformers/main/main_classes/text_generation to dive into the parameters.

In the code above, do we need to pass system_prompt or text when calling pipeline?

Yes, although you can get ok results without it. If you want to pass the system prompt to the chat llamas, you need to configure the prompt format as suggested in the blog post :)

I suggest to post questions in the forum too so it's easier to find for others! https://discuss.huggingface.co/