Open thesby opened 2 months ago
import mii import time replica_num = 8 client = mii.serve("./Qwen1.5-0.5B-Chat", deployment_name='qwen', tensor_parallel=1, replica_num=replica_num) while True: response = client.generate(["太阳与地球的距离是:", "月亮与地球的距离:"]*20, max_new_tokens=128) print(response)
watch -n 1 nvidia-smi
I run this code on 8 GPUs machine and at any time, there is only one replica being running, and other replicas are free. Any solution?
I run this code on 8 GPUs machine and at any time, there is only one replica being running, and other replicas are free. Any solution?