Closed sneha5gsm closed 2 months ago
This happens because the reductions across the 4 tensor parallel shards will produce slightly different results when compared to having no reductions with 1 GPU. These small differences compound through the model and can lead to different tokens being chosen.
So, unfortunately, this is expected. I'm going to close this issue for now, but I can reopen it if you believe that the difference is due to another reason.
Hi! Wanted to be able to send longer inputs (of the range 8k input tokens) to a 7b mpt based model and hence switched from a single GPU instance (AWS sagemaker g5.2xlarge, 24GB GPU mem) to a multi-GPU instance (AWS sagemaker g5.12x large, 4 gpus, 24GB GPU mem each). Also switched from using a tensor_parallel_size of 1 to 4 in the configurations, but now I get different outputs for the same input provided in both the scenarios (greedy based output token generation strategy). Have I missed some configuration? Why am I getting different outputs when the tensor_parallel_size is changed from 1 to 4?
Serving the model using djl-serving with rolling_batch set to vllm. djl serving configuration on g5.2x large instance:
djl serving configuration on g5.12x large instance:
Deep learning container used: DJLServing 0.26.0 with DeepSpeed 0.12.6, Hugging Face Transformers 4.36.2 and Hugging Face Accelerate 0.25.0
edit: config typo