Open MMY1994 opened 3 years ago
I'm currently struggling with the same problem. I want to run inference with two classification nets at the same time where each net is running on a different GPU. I have searched the exisiting issues for multi gpu but could not find any sufficient answer. Is it correct that the only existing workaround for this problem is to use threads where onnx is initialized in each thread?
Inferencing on multiple GPUs can be done in one of 3 ways - pipeline parallelism (where the model is split offline into multiple models and each model is inferenced on a separate GPU in a pipelined fashion to maximize GPU utilization) or tensor/model parallelism (where the computation of a model is split among multiple GPUs) or a combination of both when multiple machines are involved.
Pipeline parallelism is easier to achieve. We've experimental (not released yet, but functional) code to do this for GPT-style models and are in the process of refining it. For simple pipeline parallelism use cases (non-GPT), you can use (for e.g.) Intel TBB's parallel_pipeline or flow graph constructs to create a pipeline of multiple partitions of the model where each partition is associated with its own ORT session created on a its own and separate GPU device id. This can be completely accomplished outside ORT's codebase using ORT's public APIs. If you've any more questions, feel free to reach out to me and I can point you to the experimental code we've. Thanks!
Hi, thanks for the fast reply. Using threads and creating for each each thread its own ORT session worked really well. I had not enough time on our dual GPU machine yesterday but I did some small tests with one ORT session and pushing the models to different GPUs. Somehow this worked as well, at least the results were as expected. Is there any downsight in using one session but executing the models on different GPUs?
Is there any downsight in using one session but executing the models on different GPUs?
Currently ORT doesn't allow you to do this. In order to do so the model needs to be modified in-memory in a way that allows scatter/gather of computation across multiple GPUs. This is typically done by introducing nodes that do this scatter/gather. This is something we're currently experimenting with.
I'm not sure if we are talking about the same thing here. I have two seperate models, A and B which do not share anything. If I create a single onnx enviroment, I can execute model A on GPU0 and model B on GPU1 in seperate threads. Both threads share the sameonnx environment and I create the models on the GPU with OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, _pGPUID) where _pGPUID is 0 for model A and 1 for model B. I have checked our results, they are fine and GPU-Z shows that both GPUs run the models at the same time. I have also verified this with logging in both threads.
My question: Why do I have to create seperate onnx environments for model A and B in each threads when executing them in a single shared environment works as well?
I'm not sure if we are talking about the same thing here. I have two seperate models, A and B which do not share anything. If I create a single onnx enviroment, I can execute model A on GPU0 and model B on GPU1 in seperate threads. Both threads share the sameonnx environment and I create the models on the GPU with OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, _pGPUID) where _pGPUID is 0 for model A and 1 for model B. I have checked our results, they are fine and GPU-Z shows that both GPUs run the models at the same time. I have also verified this with logging in both threads.
My question: Why do I have to create seperate onnx environments for model A and B in each threads when executing them in a single shared environment works as well?
You don't have to use 2 different threads. You can do inferencing in the same thread, just not concurrently. So, in the same thread you can create 2 separate sessions (tied to separate GPU device ids) and inference them one after another. Does this not work for you? To ensure the correct GPU is used, you may call SetCurrentGpuDeviceId before calling Run() to set the device for the calling host thread.
That works for me. I'm also able to run both odels concurrently in different threads. It was not clear to me, if I have to use multiple enviroments if I want to run two models concurrently on two GPUs.
You don't need multiple environments. You need multiple sessions though.
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Inferencing on multiple GPUs can be done in one of 3 ways - pipeline parallelism (where the model is split offline into multiple models and each model is inferenced on a separate GPU in a pipelined fashion to maximize GPU utilization) or tensor/model parallelism (where the computation of a model is split among multiple GPUs) or a combination of both when multiple machines are involved.
Pipeline parallelism is easier to achieve. We've experimental (not released yet, but functional) code to do this for GPT-style models and are in the process of refining it. For simple pipeline parallelism use cases (non-GPT), you can use (for e.g.) Intel TBB's parallel_pipeline or flow graph constructs to create a pipeline of multiple partitions of the model where each partition is associated with its own ORT session created on a its own and separate GPU device id. This can be completely accomplished outside ORT's codebase using ORT's public APIs. If you've any more questions, feel free to reach out to me and I can point you to the experimental code we've. Thanks!
Hi, Thanks a lot for your comprehensive answer. You have proposed a solution for how we can deploy models on multiple GPUs. The way you proposed means to divide the PyTorch model into multiple parts and convert each part to ONNX and then for each ONNX part of the model, we should create a specific ORT session? This is the thing you mean? If what I undestand was correct, do you have any fast solution on how we can align the outputs of each part of the model to each other so that the total model can work perfectly?
Any updates on pipeline parallelism for GPT-like models @pranavsharma?
I got the same problem. Any updates on pipeline parallelism for GPT-like models?
I have know how to infenrence on single gpu , Use OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, gpu_id).But when I inference on multi-gpus, it reports some error. like [E:onnxruntime:OnnxruntimeInferenceEnv, cuda_call.cc:103 CudaCall] CUDA failure 700: an illegal memory access was encountered or [E:onnxruntime:OnnxruntimeInferenceEnv, cuda_call.cc:103 CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR Please tell how to do that