[Feature Request] Implement encoder_hidden_states as input in GPT2_BeamSearch Node

Borntowarn commented 1 year ago

Describe the feature request

I try to use the convert_generation.py script to create a GPT2 code generation model with beam search with encoder_hidden_states (timesformer output) as input (my base model is Neleac/timesformer-gpt2-video-captioning), but there's no such flags in scripts or node input in graph. So GPT2 coverting as separate model without link to timesformer output.

So I was wondering if there are any plans to implement this option. I've tried manually manipulating the graph and script to no avail.

Describe scenario use case

Usage of Encoder-Decoder (such as SpeechEncoderDecoderModel or VisionEncoderDecoderModel from HF)

tianleiwu commented 1 year ago

The convert generation supports encoder-decoder models (we tested T5, Bart). See the comments in the script for example uage: https://github.com/microsoft/onnxruntime/blob/b7ae293be05c89a0cb623feec4d2d2cbf006e4c3/onnxruntime/python/tools/transformers/convert_generation.py#L27-L32

ORT also support Whisper in beam search. See https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/whisper/README.md for detail.

Borntowarn commented 1 year ago

Perhaps I wrote not quite clear but I need to Bert/T5/GPT2 encoder has encoder_hidden_states from VisionEncoder (image embeddings for captioning implementation) as inputs in ths line https://github.com/microsoft/onnxruntime/blob/eb47008049a7aa0b617340bf2372723d0e873752/onnxruntime/python/tools/transformers/convert_generation.py#L702-L704

I guess you need to add it in https://github.com/microsoft/onnxruntime/blob/eb47008049a7aa0b617340bf2372723d0e873752/onnxruntime/core/graph/contrib_ops/contrib_defs.cc#L1155-L1170

microsoft / onnxruntime

[Feature Request] Implement encoder_hidden_states as input in GPT2_BeamSearch Node #18050

Describe the feature request

Describe scenario use case