Open Borntowarn opened 1 year ago
The convert generation supports encoder-decoder models (we tested T5, Bart). See the comments in the script for example uage: https://github.com/microsoft/onnxruntime/blob/b7ae293be05c89a0cb623feec4d2d2cbf006e4c3/onnxruntime/python/tools/transformers/convert_generation.py#L27-L32
ORT also support Whisper in beam search. See https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/whisper/README.md for detail.
Perhaps I wrote not quite clear but I need to Bert/T5/GPT2 encoder has encoder_hidden_states from VisionEncoder (image embeddings for captioning implementation) as inputs in ths line https://github.com/microsoft/onnxruntime/blob/eb47008049a7aa0b617340bf2372723d0e873752/onnxruntime/python/tools/transformers/convert_generation.py#L702-L704
I guess you need to add it in https://github.com/microsoft/onnxruntime/blob/eb47008049a7aa0b617340bf2372723d0e873752/onnxruntime/core/graph/contrib_ops/contrib_defs.cc#L1155-L1170
Describe the feature request
I try to use the convert_generation.py script to create a GPT2 code generation model with beam search with encoder_hidden_states (timesformer output) as input (my base model is Neleac/timesformer-gpt2-video-captioning), but there's no such flags in scripts or node input in graph. So GPT2 coverting as separate model without link to timesformer output.
So I was wondering if there are any plans to implement this option. I've tried manually manipulating the graph and script to no avail.
Describe scenario use case
Usage of Encoder-Decoder (such as SpeechEncoderDecoderModel or VisionEncoderDecoderModel from HF)