[Bug]: 1-card deployment and 2-card deployment yield inconsistent output logits.

thisissum commented 5 months ago

Your current environment

version: v0.4.1 device: A800*2 model: qwen-14b-chat

🐛 Describe the bug

I added a print statement in the following code.

# vllm.model_executor.layers.sampler.py
# line 53-58
assert logits is not None
_, vocab_size = logits.shape
print(torch.mean(logits).cpu()) # I added my code here
# Apply min_tokens penalty which sets stop tokens to -inf if min_tokens
# have not been generated yet
logits = _apply_min_tokens_penalty(logits, sampling_metadata)

Even when using the same decoding parameters, the output logits still changes when I increase the tensor-parallel-size from 1 to 2.

thisissum commented 5 months ago

Your current environment

version: v0.4.1 device: A800*2 model: qwen-14b-chat

🐛 Describe the bug

I added a print statement in the following code.
# vllm.model_executor.layers.sampler.py
# line 53-58
assert logits is not None
_, vocab_size = logits.shape
print(torch.mean(logits).cpu()) # I added my code here
# Apply min_tokens penalty which sets stop tokens to -inf if min_tokens
# have not been generated yet
logits = _apply_min_tokens_penalty(logits, sampling_metadata)
Even when using the same decoding parameters, the output logits still changes when I increase the tensor-parallel-size from 1 to 2.

I use "seed=1024" in generation

simon-mo commented 5 months ago

How much is the difference and can you show a repro script?

vllm-project / vllm

[Bug]: 1-card deployment and 2-card deployment yield inconsistent output logits. #4445

Your current environment

🐛 Describe the bug

Your current environment

🐛 Describe the bug