microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.9k stars 175 forks source link

terminate_server only release memory on one gpu when using tensor_parallel #265

Closed baojunliu closed 1 year ago

baojunliu commented 1 year ago

I am trying to use two GPUs with tensor_parallel=2. It seems it only releases memory on one gpu. There is some process still running. The client.terminate_server doesn't seem to kill all processes. I can kill the process manually, but how can I do it properly in the python code?

import mii

client = mii.serve("mistralai/Mistral-7B-v0.1",
                   deployment_name="ray_scorer_deployment",
                   tensor_parallel=2)
response = client.generate("Deepspeed is", max_new_tokens=128)
client.terminate_server()

print(response.response)
mrwyattii commented 1 year ago

This was a bug that has been fixed in #262. Please update to the latest main (we will also do a patch release with this and other bug fixes later this week).

mrwyattii commented 1 year ago

Closing, this was resolved in the latest release (v0.1.1).