microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.05k stars 2.83k forks source link

[Question] graph early exiting #4484

Closed ykim362 closed 3 years ago

ykim362 commented 4 years ago

Is your feature request related to a problem? Please describe. During a single inference session, is there a way to stop the execution earlier before finishing all the nodes in the graph based on a certain condition? (e.g. There are 5 layers in a model, but stop and return intermediate results just after finishing 2 layers.)

System information

Describe the solution you'd like Early exiting of the graph. In my use case, it needs to be dynamically decided. In other words, the exit layer can be different for every inference.

Describe alternatives you've considered

Additional context

askhade commented 4 years ago

@ykim362 : Is early exit possible -> No However if you want to fetch intermediate outputs of the graph you can do the following:

Build ORT with option : onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS https://github.com/microsoft/onnxruntime/blob/768ced703c49cf16545bc0324029b072f7261d0f/cmake/CMakeLists.txt#L81

When ort is built with this option it dumps all intermediate node outputs. In case you are interested the code for dumping outputs is here : https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/framework/utils.h#L75

ykim362 commented 4 years ago

@askhade Thanks for the information. My goal here is to save the computation (if I stopped at layer 2, I don't need to compute layer 3 - layer 5. I can just use the output of layer 2) rather than knowing the the intermediate values. Is there any plan to add dynamic exiting? And, could there be any workaround to achieve this (or mimic) for now?

askhade commented 4 years ago

I don't think we have any such plans right now. Can you elaborate in what cases is this useful?

@pranavsharma and @faxu : for more comments.

ykim362 commented 4 years ago

@askhade There are multiple research papers which use this to improve inference speed by stopping early. One example of them is DeeBERT (paper: https://arxiv.org/pdf/2004.12993v1.pdf, code: https://github.com/huggingface/transformers/tree/master/examples/deebert, https://github.com/castorini/deebert)

dashesy commented 4 years ago

Maybe a similar effect can be achieved by short-circuiting the graph computation when 0 is multiplied to the result of a subgraph? output_i+1=(output_i > threshold).float() * subgraph(x, output_i) Then you can have multiple outputs, one at each ramp, but the ones that are multiplied to 0 at the beginning hopefully would be optimized out by runtime.

P.S. I have a autoregression and currently am looking at for-loop but early exit may be difficult even then.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

stale[bot] commented 3 years ago

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

spacemanidol commented 2 years ago

Has there been any work on this item? I would be keen to have support here and could work on having this supported.

ykim362 commented 2 years ago

Has there been any work on this item? I would be keen to have support here and could work on having this supported.

@askhade might be able to answer?