vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.79k stars 4.1k forks source link

[Bug]: 1-card deployment and 2-card deployment yield inconsistent output logits. #4445

Open thisissum opened 5 months ago

thisissum commented 5 months ago

Your current environment

version: v0.4.1 device: A800*2 model: qwen-14b-chat

🐛 Describe the bug

I added a print statement in the following code.

# vllm.model_executor.layers.sampler.py
# line 53-58
assert logits is not None
_, vocab_size = logits.shape
print(torch.mean(logits).cpu()) # I added my code here
# Apply min_tokens penalty which sets stop tokens to -inf if min_tokens
# have not been generated yet
logits = _apply_min_tokens_penalty(logits, sampling_metadata)

Even when using the same decoding parameters, the output logits still changes when I increase the tensor-parallel-size from 1 to 2.

thisissum commented 5 months ago

Your current environment

version: v0.4.1 device: A800*2 model: qwen-14b-chat

🐛 Describe the bug

I added a print statement in the following code.

# vllm.model_executor.layers.sampler.py
# line 53-58
assert logits is not None
_, vocab_size = logits.shape
print(torch.mean(logits).cpu()) # I added my code here
# Apply min_tokens penalty which sets stop tokens to -inf if min_tokens
# have not been generated yet
logits = _apply_min_tokens_penalty(logits, sampling_metadata)

Even when using the same decoding parameters, the output logits still changes when I increase the tensor-parallel-size from 1 to 2.

I use "seed=1024" in generation

simon-mo commented 5 months ago

How much is the difference and can you show a repro script?