microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.66k stars 2.93k forks source link

using multithread to call onnxruntime inference, #11628

Open cqray1990 opened 2 years ago

cqray1990 commented 2 years ago

Describe the bug A clear and concise description of what the bug is. To avoid repetition please make sure this is not one of the known issues mentioned on the respective release page.

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

To Reproduce

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

class mode():

 def __init__():
      self.sess1= seesion()
     self.sess2 = session()
 def predict():
     pass

here,there are multi seddion ,cause have multi models,

i create instance like onnxmode = mode() and use Multithread ways to call predicted, the predict time is become so slow,but without Multhread just one thread to call predict the time is normal,why?

tianleiwu commented 2 years ago

Using multiple threading means sessions compete with limited system resource (like CPU cores, GPU etc), and the competition could slow each thread. Below is a simple guide that could reduce the competition by proper configuration:

If you need two models to run in a sequence, like run encoder model first then decoder model, then you do not need multiple thread. Use one thread to run sessions one by one.

Note that two sessions might have extra overhead (like each session might have their own arena memory allocation or thread pool). If you are able to use ORT C API, you can pass shared allocator and a global thread pool to avoid the overhead. If you are able to run models in subgraph (like using workflow operators If, Loop or BeamSearch, or create your own custom op), you are able to use one session to do inference (although a session is created for each subgraph in ORT internally, ORT could arrange resources and share arena memory for them properly).

If your models are independent, it is better to have one process per model. Then you can use numactl to set CPU affinity for each process. If your computer have multiple GPUs, you might also assign different GPU to be visible to each process.

cqray1990 commented 2 years ago

@tianleiwu

cause my assignment need more than on model to predict,

class modelpp(): def init(self): self.sess1= session() self.sess2 = session() def call(self, img): out1 = self.sess1(img) out2= self.sess2(out1)

def multit_thread(frames): sessions = [modelpp(),modelpp)] threads = [] since_infer = time.time() for session in sessions: thread = Thread(target=run_session_thread, args=(session,frames)) thread.start() threads.append(thread) for thread in threads: thread.join()

print("infer:", time.time() - since_infer)

return

this make predict slower,alse in c#api with the same principle, also slower ,how did detail this problem, with sessions = [modelpp()] ,only one thread it is normal

cqray1990 commented 2 years ago

@tianleiwu what did this means "If you are able to use ORT C API, you can pass shared allocator and a global thread pool to avoid the overhead. If you are able to run models in subgraph (like using workflow operators If, Loop or BeamSearch, or create your own custom op)," i indeed use c Api to do inference, cause C APi is lower, so i use python to verify this problem,,but python code is also slower,

how to do about ""If you are able to use ORT C API, you can pass shared allocator and a global thread pool to avoid the overhead. If you are able to run models in subgraph (like using workflow operators If, Loop or BeamSearch, or create your own custom op"

cqray1990 commented 2 years ago

@tianleiwu i need Multithread scheduling to do inference for for the same assignment, pass many input images to do inference,so wo need multithread

cqray1990 commented 2 years ago

@mattetti @mtodd

pranavsharma commented 2 years ago

Please take a look at https://onnxruntime.ai/docs/get-started/with-c.html (sections Global/shared threadpools, Share allocator(s) between sessions).

tianleiwu commented 2 years ago

@cqray1990, Regarding the code you shared:

out1 = self.sess1(img)
out2= self.sess2(out1)

Session 2 looks like post-processing. Actually, you can export one model with both original model and post-processing. Or you can use a tool to merge two onnx graph into one. In this way, you could avoid many problems caused by multiple sessions.

Regarding to passing many input images, could you make your model to accept a batch of images? ONNX support dynamic axes, so it is possible to process a batch of N images in one inference run. If you are using Triton with onnxruntime, it could help you dynamic batching.

cqray1990 commented 2 years ago

my assignment need two models, cause , i need the first model results and get needed data,then feed into the second model, it is Serial Execution

cqray1990 commented 2 years ago

@tianleiwu

tianleiwu commented 2 years ago

@cqray1990,

If you are able to export the logic of get needed data into ONNX graph. Then you are able to use one model to replace two original models.

If the logic is not easy to represent by tensor operations, onnxruntime extensions or creating your own custom ops (for getting needed data for second part) might help. In the way, you can use a custom op to link two parts (corresponding to previous two models) and merge them into one model.

mmxuan18 commented 2 years ago

do onnxruntime support subgraph run, such as model has {input_1, input_2, input_3} {output_1, output_2, output_3}, in the onnx model, the input_1 is idnetity to output_1 as example, this input - ouput pair does't has other releate node, can session.run(NONE, {'input_1': data1}) @cqray1990

tianleiwu commented 2 years ago

@mmxun18, I have not tried it before.

BTW, here is an example of merging two models (encoder and decoder) into one: https://github.com/microsoft/onnxruntime/blob/040c2f4517775ce8362903837346213babbdf6b1/onnxruntime/python/tools/transformers/models/t5/t5_encoder_decoder_init.py#L31-L57 The idea is to merge sequential execution into one model, but not to merge two independent models.