meta-llama / codellama

Inference code for CodeLlama models
Other
16.06k stars 1.87k forks source link

Batching generates broken answers #112

Open realhaik opened 1 year ago

realhaik commented 1 year ago

There are multiple posts on the internet about llama2 models generating bad output when running more than one instruction using the batch option. I can confirm that this true on all llama2 and llamacode models.
One instruction works as expected, but 2 instructions make the model go crazy and output junk. Combined instructions length is within the max_seq_len, so there is no truncation... It seems that the model becomes "less smart" when batching.

The question is why and how to fix?

    dialogs = [
        [          
              {
                "role": "system",
                "content": """for each address in the following text return a json object [{\""t\"":street,\""c\"":city,\""s\"":state,\""d\"":country}]"""
},
            {"role": "user", "content":text}],

             [          
              {
                "role": "system",
                "content": """for each address in the following text return a json object[{\""t\"":street,\""c\"":city,\""s\"":state,\""d\"":country}]"""
},
            {"role": "user", "content":text}]
    ]

    results = generator.chat_completion(
        dialogs,  
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )
GaganHonor commented 1 year ago

there is no specific fix available for this problem at the moment.

someone op will comment here soon 🤖

99991 commented 1 year ago

I can not reproduce your issue. It seems to work as expected, see below.

Code

from typing import Optional
import fire
from llama import Llama

def main(
    ckpt_dir: str,
    tokenizer_path: str,
    temperature: float = 0.2,
    top_p: float = 0.95,
    max_seq_len: int = 512,
    max_batch_size: int = 8,
    max_gen_len: Optional[int] = None,
):
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
    )

    text = "The street is 068 Angelina Walks in West Hayfield of Virginia, 63622."

    instructions = [
        [
            {
                "role": "system",
                "content": """for each address in the following text return a json object [{\""t\"":street,\""c\"":city,\""s\"":state,\""d\"":country}]"""
            },
            {
                "role": "user",
                "content": text
            }
        ],
    ] * 3 # same instruction 3 times

    results = generator.chat_completion(
        instructions,
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )

    for result in results:
        print(result['generation']['role'].capitalize(), "says:\n")
        print(result['generation']['content'])
        print("\n==================================\n")

if __name__ == "__main__":
    fire.Fire(main)

Output

$ torchrun --nproc_per_node 1 my_main.py     --ckpt_dir CodeLlama-7b-Instruct/     --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model     --max_seq_len 512 --max_batch_size 4
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 11.62 seconds
Assistant says:

 {
"t": "068 Angelina Walks",
"c": "West Hayfield",
"s": "Virginia",
"d": "USA"
}

==================================

Assistant says:

 {
"t": "068 Angelina Walks",
"c": "West Hayfield",
"s": "Virginia",
"d": "USA"
}

==================================

Assistant says:

 {
"t": "068 Angelina Walks",
"c": "West Hayfield",
"s": "Virginia",
"d": "USA"
}

==================================