microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

Document beamsearch #12584

Open dashesy opened 2 years ago

dashesy commented 2 years ago

Is your feature request related to a problem? Please describe. I currently run the encoder ONNX, get features then prepare things like input_ids and pass to another decoder ONNX multiple times. This process is not very efficient. Specially if the models run on GPU and we have to go to CPU for some tensors and back to GPU again.

I tried combining them but it runs into cyclical issues and very large models. I just realized there is some "BeamSearch" support in the kernels and some "conver_beam_search.py" in some commits.

It would be nice if conver_beam_search.py could work with any 2 encder-decoder models. The models I have are neither GPT, nor T5.

System information

Describe the solution you'd like General method to combine encoder-decoder and do a beam-search. As long as encoder returns a specific tensor, and decoder accepts specific tensors this should be a generic operation.

It would be nice if the conversion is smart enough to reuse decoding history

Describe alternatives you've considered I use two ONNX files and do autoregression outside ONNX

Additional context I am interested in Vision-Language (VL) models. These models are SOTA, and some have specific architectures that do convert to ONNX but are not standard or well known

yufenglee commented 2 years ago

Is this IOBinding feature what you are looking for? https://github.com/microsoft/onnxruntime/blob/0c6037b5abe571fc43a55ef7a9d2f846820fbe5d/docs/python/inference/api_summary.rst#data-on-device

dashesy commented 2 years ago

Not exactly. If there is a tensor that needs to be manipulated before passing to the decoder in each step, that is a trip to CPU and back. There is also the extra overhead of execution planning at each decoder step, which can be avoided if there was a single ONNX file. Plus having a single ONNX is helpful.

yufenglee commented 2 years ago

Looks like your scenario can be supported with customized beam search: https://github.com/ViswanathB/ORTCustomBeamsearchOpLibrary.

dashesy commented 2 years ago

Good one to take a look at. Still I do not know why I could not use com.microsoft.BeamSearch itself. It looks like it has all the pieces. I do want to use it for text generation. It just needs more documentation and examples I think

tianleiwu commented 2 years ago

@dashesy, as long as your encoder and decoder has same inputs and outputs as T5 (It is fine that the modeling logic is different), you can use com.microsoft.BeamSearch. You can follow T5 onnx conversion script to export your model to two onnx models. The input and output names and shapes shall exactly match those in t5 onnx models.

Then use convert_generation.py and specify those two onnx models as input: https://github.com/microsoft/onnxruntime/blob/eb6aa861cfa7295ee9f7145db44aaec708e8ce1c/onnxruntime/python/tools/transformers/convert_generation.py#L117-L131. May need comment or change some test code in the script.

dashesy commented 2 years ago

VL models are not exactly the same. But I think they can be adapted to use this. I will look into it.

Encoder input: "image" Encoder output: "hidden_states"

Decoder input: "hidden_states", "input_ids" ("input_mask" optional because I just create the mask based on "input_ids" inside ONNX dynamically) Decoder output: "logits"

We call the encoder just once, get "hidden_states" (usually no need to encode a text as well depending on task) In each step we append the next token (e.g. topk from "logits") to "input_ids" and pass to decoder again (beam search)

So in a way, this is simpler than T5 model.