Closed gitsand996 closed 1 year ago
Hi, I don't think this would not be related to the MPT architecture, but you could double check by trying it with a different model. Assuming it is not related to the MPT architecture, maybe it is a question better answered by the huggingface folks.
Hi,
I want to test model's inference on my hardware. I am using a A100 single instance GPU with 60Gb memory. I have created 4 processes (and 4 model instances, 12Gb each) in order to experiment the inference time with 4 parallel requests.
Even if requests are handled in parallel, the model.generate() seems to be blocking for the GPU usage and it doesn't actually parallelize. Is this related to the mpt architecture or to the AutoModel memory management?
How can I allow a real parallelization? I think a solution could be create more GPU partitions. But I am wondering if there is another way to solve this.
Thank you!