vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.9k stars 4.51k forks source link

[Misc]: How to force generate a fixed response from llama3 #7770

Open niuzheng168 opened 2 months ago

niuzheng168 commented 2 months ago

Anything you want to discuss about vllm.

I found the llama generate different response for a static input:

  1. Running multiple times
  2. Batch it with other input

This happen even when I set temp=1 top_k=1 and random seed.

The generated test usually same at first few tokens, but after them they will be difference.

Anyone knows how to force generate a fixed response?

import torch
from vllm import LLM, SamplingParams

torch.random.manual_seed(999)

llm = LLM(model='/home/Meta-Llama-3-8B-Instruct')
prompts = [
    "Hi my name is",
    "The capital of France is"
]

# generate multiple time
texts = []
for i in range(10):
    sampling_params = SamplingParams(temperature=1, top_k=1, max_tokens=100, top_p=1)
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        texts.append(generated_text)
for text in texts:
    print(text)

# generate with different batch
texts = []
for i in range(5):
    prompts.append(prompts[0])
    prompts.append(prompts[1])

    sampling_params = SamplingParams(temperature=1, top_k=1, max_tokens=100)
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        texts.append(generated_text)

for text in texts:
    print(text)
arunpatala commented 2 months ago

btw temperature should be 0.

There was an interesting paper that discusses this issue which basically says its due to GPU non determinism.

https://arxiv.org/pdf/2408.04667

Hope you find it useful.