mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
4.01k stars 524 forks source link

Multiple models inference on Single-GPU #259

Closed gitsand996 closed 1 year ago

gitsand996 commented 1 year ago

Hi,

I want to test model's inference on my hardware. I am using a A100 single instance GPU with 60Gb memory. I have created 4 processes (and 4 model instances, 12Gb each) in order to experiment the inference time with 4 parallel requests.

Even if requests are handled in parallel, the model.generate() seems to be blocking for the GPU usage and it doesn't actually parallelize. Is this related to the mpt architecture or to the AutoModel memory management?

How can I allow a real parallelization? I think a solution could be create more GPU partitions. But I am wondering if there is another way to solve this.

Thank you!

dakinggg commented 1 year ago

Hi, I don't think this would not be related to the MPT architecture, but you could double check by trying it with a different model. Assuming it is not related to the MPT architecture, maybe it is a question better answered by the huggingface folks.