Multiple models inference on Single-GPU

mosaicml / llm-foundry

LLM training code for Databricks foundation models

Apache License 2.0

4.01k stars 524 forks source link

Hi,

I want to test model's inference on my hardware. I am using a A100 single instance GPU with 60Gb memory. I have created 4 processes (and 4 model instances, 12Gb each) in order to experiment the inference time with 4 parallel requests.

Even if requests are handled in parallel, the model.generate() seems to be blocking for the GPU usage and it doesn't actually parallelize. Is this related to the mpt architecture or to the AutoModel memory management?

How can I allow a real parallelization? I think a solution could be create more GPU partitions. But I am wondering if there is another way to solve this.

Thank you!

mosaicml / llm-foundry

Multiple models inference on Single-GPU #259