microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.11k stars 2.84k forks source link

How to inference on multi-gpus #8244

Open MMY1994 opened 3 years ago

MMY1994 commented 3 years ago

I have know how to infenrence on single gpu , Use OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, gpu_id).But when I inference on multi-gpus, it reports some error. like [E:onnxruntime:OnnxruntimeInferenceEnv, cuda_call.cc:103 CudaCall] CUDA failure 700: an illegal memory access was encountered or [E:onnxruntime:OnnxruntimeInferenceEnv, cuda_call.cc:103 CudaCall] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR Please tell how to do that

bdnkth commented 3 years ago

I'm currently struggling with the same problem. I want to run inference with two classification nets at the same time where each net is running on a different GPU. I have searched the exisiting issues for multi gpu but could not find any sufficient answer. Is it correct that the only existing workaround for this problem is to use threads where onnx is initialized in each thread?

pranavsharma commented 3 years ago

Inferencing on multiple GPUs can be done in one of 3 ways - pipeline parallelism (where the model is split offline into multiple models and each model is inferenced on a separate GPU in a pipelined fashion to maximize GPU utilization) or tensor/model parallelism (where the computation of a model is split among multiple GPUs) or a combination of both when multiple machines are involved.

Pipeline parallelism is easier to achieve. We've experimental (not released yet, but functional) code to do this for GPT-style models and are in the process of refining it. For simple pipeline parallelism use cases (non-GPT), you can use (for e.g.) Intel TBB's parallel_pipeline or flow graph constructs to create a pipeline of multiple partitions of the model where each partition is associated with its own ORT session created on a its own and separate GPU device id. This can be completely accomplished outside ORT's codebase using ORT's public APIs. If you've any more questions, feel free to reach out to me and I can point you to the experimental code we've. Thanks!

bdnkth commented 3 years ago

Hi, thanks for the fast reply. Using threads and creating for each each thread its own ORT session worked really well. I had not enough time on our dual GPU machine yesterday but I did some small tests with one ORT session and pushing the models to different GPUs. Somehow this worked as well, at least the results were as expected. Is there any downsight in using one session but executing the models on different GPUs?

pranavsharma commented 3 years ago

Is there any downsight in using one session but executing the models on different GPUs?

Currently ORT doesn't allow you to do this. In order to do so the model needs to be modified in-memory in a way that allows scatter/gather of computation across multiple GPUs. This is typically done by introducing nodes that do this scatter/gather. This is something we're currently experimenting with.

bdnkth commented 3 years ago

I'm not sure if we are talking about the same thing here. I have two seperate models, A and B which do not share anything. If I create a single onnx enviroment, I can execute model A on GPU0 and model B on GPU1 in seperate threads. Both threads share the sameonnx environment and I create the models on the GPU with OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, _pGPUID) where _pGPUID is 0 for model A and 1 for model B. I have checked our results, they are fine and GPU-Z shows that both GPUs run the models at the same time. I have also verified this with logging in both threads.

My question: Why do I have to create seperate onnx environments for model A and B in each threads when executing them in a single shared environment works as well?

pranavsharma commented 3 years ago

I'm not sure if we are talking about the same thing here. I have two seperate models, A and B which do not share anything. If I create a single onnx enviroment, I can execute model A on GPU0 and model B on GPU1 in seperate threads. Both threads share the sameonnx environment and I create the models on the GPU with OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, _pGPUID) where _pGPUID is 0 for model A and 1 for model B. I have checked our results, they are fine and GPU-Z shows that both GPUs run the models at the same time. I have also verified this with logging in both threads.

My question: Why do I have to create seperate onnx environments for model A and B in each threads when executing them in a single shared environment works as well?

You don't have to use 2 different threads. You can do inferencing in the same thread, just not concurrently. So, in the same thread you can create 2 separate sessions (tied to separate GPU device ids) and inference them one after another. Does this not work for you? To ensure the correct GPU is used, you may call SetCurrentGpuDeviceId before calling Run() to set the device for the calling host thread.

bdnkth commented 3 years ago

That works for me. I'm also able to run both odels concurrently in different threads. It was not clear to me, if I have to use multiple enviroments if I want to run two models concurrently on two GPUs.

pranavsharma commented 3 years ago

You don't need multiple environments. You need multiple sessions though.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

hannan72 commented 2 years ago

Inferencing on multiple GPUs can be done in one of 3 ways - pipeline parallelism (where the model is split offline into multiple models and each model is inferenced on a separate GPU in a pipelined fashion to maximize GPU utilization) or tensor/model parallelism (where the computation of a model is split among multiple GPUs) or a combination of both when multiple machines are involved.

Pipeline parallelism is easier to achieve. We've experimental (not released yet, but functional) code to do this for GPT-style models and are in the process of refining it. For simple pipeline parallelism use cases (non-GPT), you can use (for e.g.) Intel TBB's parallel_pipeline or flow graph constructs to create a pipeline of multiple partitions of the model where each partition is associated with its own ORT session created on a its own and separate GPU device id. This can be completely accomplished outside ORT's codebase using ORT's public APIs. If you've any more questions, feel free to reach out to me and I can point you to the experimental code we've. Thanks!

Hi, Thanks a lot for your comprehensive answer. You have proposed a solution for how we can deploy models on multiple GPUs. The way you proposed means to divide the PyTorch model into multiple parts and convert each part to ONNX and then for each ONNX part of the model, we should create a specific ORT session? This is the thing you mean? If what I undestand was correct, do you have any fast solution on how we can align the outputs of each part of the model to each other so that the total model can work perfectly?

NouamaneTazi commented 2 years ago

Any updates on pipeline parallelism for GPT-like models @pranavsharma?

lidesheng0477 commented 1 year ago

I got the same problem. Any updates on pipeline parallelism for GPT-like models?