vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.41k stars 4.6k forks source link

Mpt based model generates different outputs for varying tensor_parallel_size #2891

Closed sneha5gsm closed 2 months ago

sneha5gsm commented 9 months ago

Hi! Wanted to be able to send longer inputs (of the range 8k input tokens) to a 7b mpt based model and hence switched from a single GPU instance (AWS sagemaker g5.2xlarge, 24GB GPU mem) to a multi-GPU instance (AWS sagemaker g5.12x large, 4 gpus, 24GB GPU mem each). Also switched from using a tensor_parallel_size of 1 to 4 in the configurations, but now I get different outputs for the same input provided in both the scenarios (greedy based output token generation strategy). Have I missed some configuration? Why am I getting different outputs when the tensor_parallel_size is changed from 1 to 4?

Serving the model using djl-serving with rolling_batch set to vllm. djl serving configuration on g5.2x large instance:

engine=Python
option.task=text-generation
option.s3url={{s3url}}
option.tensor_parallel_degree=1
option.rolling_batch=vllm
option.dtype=bf16
Dai.djl.logging.level=debug

djl serving configuration on g5.12x large instance:

engine=Python
option.task=text-generation
option.s3url={{s3url}}
option.tensor_parallel_degree=4
option.rolling_batch=vllm
option.dtype=bf16
Dai.djl.logging.level=debug

Deep learning container used: DJLServing 0.26.0 with DeepSpeed 0.12.6, Hugging Face Transformers 4.36.2 and Hugging Face Accelerate 0.25.0

edit: config typo

hmellor commented 2 months ago

This happens because the reductions across the 4 tensor parallel shards will produce slightly different results when compared to having no reductions with 1 GPU. These small differences compound through the model and can lead to different tokens being chosen.

So, unfortunately, this is expected. I'm going to close this issue for now, but I can reopen it if you believe that the difference is due to another reason.