Mpt based model generates different outputs for varying tensor_parallel_size

Hi! Wanted to be able to send longer inputs (of the range 8k input tokens) to a 7b mpt based model and hence switched from a single GPU instance (AWS sagemaker g5.2xlarge, 24GB GPU mem) to a multi-GPU instance (AWS sagemaker g5.12x large, 4 gpus, 24GB GPU mem each). Also switched from using a tensor_parallel_size of 1 to 4 in the configurations, but now I get different outputs for the same input provided in both the scenarios (greedy based output token generation strategy). Have I missed some configuration? Why am I getting different outputs when the tensor_parallel_size is changed from 1 to 4?

Serving the model using djl-serving with rolling_batch set to vllm. djl serving configuration on g5.2x large instance:

engine=Python
option.task=text-generation
option.s3url={{s3url}}
option.tensor_parallel_degree=1
option.rolling_batch=vllm
option.dtype=bf16
Dai.djl.logging.level=debug

djl serving configuration on g5.12x large instance:

engine=Python
option.task=text-generation
option.s3url={{s3url}}
option.tensor_parallel_degree=4
option.rolling_batch=vllm
option.dtype=bf16
Dai.djl.logging.level=debug

Deep learning container used: DJLServing 0.26.0 with DeepSpeed 0.12.6, Hugging Face Transformers 4.36.2 and Hugging Face Accelerate 0.25.0

edit: config typo

vllm-project / vllm

Mpt based model generates different outputs for varying tensor_parallel_size #2891