Open realhaik opened 1 year ago
there is no specific fix available for this problem at the moment.
someone op will comment here soon 🤖
I can not reproduce your issue. It seems to work as expected, see below.
from typing import Optional
import fire
from llama import Llama
def main(
ckpt_dir: str,
tokenizer_path: str,
temperature: float = 0.2,
top_p: float = 0.95,
max_seq_len: int = 512,
max_batch_size: int = 8,
max_gen_len: Optional[int] = None,
):
generator = Llama.build(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
max_seq_len=max_seq_len,
max_batch_size=max_batch_size,
)
text = "The street is 068 Angelina Walks in West Hayfield of Virginia, 63622."
instructions = [
[
{
"role": "system",
"content": """for each address in the following text return a json object [{\""t\"":street,\""c\"":city,\""s\"":state,\""d\"":country}]"""
},
{
"role": "user",
"content": text
}
],
] * 3 # same instruction 3 times
results = generator.chat_completion(
instructions,
max_gen_len=max_gen_len,
temperature=temperature,
top_p=top_p,
)
for result in results:
print(result['generation']['role'].capitalize(), "says:\n")
print(result['generation']['content'])
print("\n==================================\n")
if __name__ == "__main__":
fire.Fire(main)
$ torchrun --nproc_per_node 1 my_main.py --ckpt_dir CodeLlama-7b-Instruct/ --tokenizer_path CodeLlama-7b-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 4
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 11.62 seconds
Assistant says:
{
"t": "068 Angelina Walks",
"c": "West Hayfield",
"s": "Virginia",
"d": "USA"
}
==================================
Assistant says:
{
"t": "068 Angelina Walks",
"c": "West Hayfield",
"s": "Virginia",
"d": "USA"
}
==================================
Assistant says:
{
"t": "068 Angelina Walks",
"c": "West Hayfield",
"s": "Virginia",
"d": "USA"
}
==================================
There are multiple posts on the internet about llama2 models generating bad output when running more than one instruction using the batch option. I can confirm that this true on all llama2 and llamacode models.
One instruction works as expected, but 2 instructions make the model go crazy and output junk. Combined instructions length is within the max_seq_len, so there is no truncation... It seems that the model becomes "less smart" when batching.
The question is why and how to fix?