microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.76k stars 163 forks source link

Only running one replica even though setting many replicas #465

Open thesby opened 2 months ago

thesby commented 2 months ago
import mii
import time

replica_num = 8

client = mii.serve("./Qwen1.5-0.5B-Chat", deployment_name='qwen', tensor_parallel=1, replica_num=replica_num)
while True:
    response = client.generate(["太阳与地球的距离是:", "月亮与地球的距离:"]*20, max_new_tokens=128)
    print(response)
watch -n 1 nvidia-smi

I run this code on 8 GPUs machine and at any time, there is only one replica being running, and other replicas are free. Any solution?