[Feature Request] GPU Graph optimization for Flan-T5-Large

Matthieu-Tinycoaching commented 1 year ago

Describe the feature request

Would it be possible to add GPU graph optimizations for Flan-T5-Large model?

Describe scenario use case

Actually, after having exported the model to ONNX and trying to optimize it with ORTOptimizer as below:

from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig

onnx_path = Path("./flan-t5-large")

# Create ORTOptimizer
optimizer = ORTOptimizer.from_pretrained(ort_model)

# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=1,
                                        optimize_for_gpu=True,
                                        fp16=True
                                        )

# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

I got the following error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In [10], line 16
     10 optimization_config = OptimizationConfig(optimization_level=1,
     11                                         optimize_for_gpu=True,
     12                                         fp16=True
     13                                         )
     15 # Optimize the model
---> 16 optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

File ~/anaconda3/envs/optimum_gpu_py3.8/lib/python3.8/site-packages/optimum/onnxruntime/optimization.py:128, in ORTOptimizer.optimize(self, optimization_config, save_dir, file_suffix, use_external_data_format, one_external_file)
    126 save_dir = Path(save_dir)
    127 save_dir.mkdir(parents=True, exist_ok=True)
--> 128 ORTConfigManager.check_optimization_supported_model(self.model_type)
    130 self.config.save_pretrained(save_dir)
    131 maybe_save_preprocessors(self.onnx_model_path[0].parent, save_dir)

File ~/anaconda3/envs/optimum_gpu_py3.8/lib/python3.8/site-packages/optimum/onnxruntime/utils.py:120, in ORTConfigManager.check_optimization_supported_model(cls, model_type)
    118 supported_model_types_for_optimization = ["bert", "gpt2", "bart"]
    119 if (model_type not in cls._conf) or (cls._conf[model_type] not in supported_model_types_for_optimization):
--> 120     raise KeyError(
    121         f"ONNX Runtime doesn't support the graph optimization of {model_type} yet. Only {supported_model_types_for_optimization} are supported. "
    122         f"If you want to support {model_type} please propose a PR or open up an issue in ONNX Runtime:https://github.com/microsoft/onnxruntime."
    123     )

KeyError: "ONNX Runtime doesn't support the graph optimization of t5 yet. Only ['bert', 'gpt2', 'bart'] are supported. If you want to support t5 please propose a PR or open up an issue in ONNX Runtime:https://github.com/microsoft/onnxruntime."

tianleiwu commented 1 year ago

@wangyems, could you take a look?

wangyems commented 1 year ago

The T-5 optimization is in progress. we will take a look at this model then.

argideritzalpea commented 1 year ago

@wangyems Thanks for looking into this. Do you know if there is an estimation for when this feature might be integrated into onnxruntime?

wangyems commented 1 year ago

hi @Matthieu-Tinycoaching and @argideritzalpea, we have checked in major part of optimizations for t5 and mt-5. I just found that there's minor graph discrepancies that block the optimization script targeting flan-t5. Fix should be available in a week or so.

wangyems commented 1 year ago

you can pull this branch and use convert_to_onnx.py to export flan-t5 graphs.

we also have a one-line command to export flan-t5 with beam search using convert_generation.py. A beam search model with decoder and encoder graph will be generated under /google

python convert_generation.py -m google/flan-t5-large --model_type t5 --output flan-t5-beamsearch.onnx -e --use_gpu

For T5 inference, you'll need to use an ORT nightly build

Taytay commented 1 year ago

@wangyems: Thank you! Is it still the case that I need another branch for this? Also, do I still need nightly for inference?

Your branch added some flan models to the list of defaults here: https://github.com/microsoft/onnxruntime/compare/wangye/flan#diff-593772fa94b2b631576ee280ae1c0ee320f532593b9e723c5f3496be6b26647aR28-R37

I don't think they are necessary though, right?

I downloaded nightly it looks like I can run this on flan-t5-base okay: python -m onnxruntime.transformers.convert_generation -m google/flan-t5-base --model_type t5 -e --use_gpu --output ./output/models/t5/onnx_models/flan_t5_base_beam_search.onnx --output_sequences_scores --num_beams=5 --num_return_sequences=5 --length_penalty=0

However, it doesn't work for flan-t5-small due to this error:

python -m onnxruntime.transformers.convert_generation -m google/flan-t5-small --model_type t5 -e --use_gpu --output ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx --output_sequences_scores --num_beams=5 --num_return_sequences=5 --length_penalty=0
**** past_present_share_buffer=False
Convert model google/flan-t5-small to onnx ...
Exporting ONNX model to output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx
/opt/conda/envs/jupyter/lib/python3.10/site-packages/transformers/modeling_utils.py:832: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if causal_mask.shape[1] < attention_mask.shape[1]:
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Delete the existed onnx file: output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx
Delete the existed external data file: output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx.data
Optimizing model to output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init_fp32.onnx
Traceback (most recent call last):
  File "/opt/conda/envs/jupyter/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/jupyter/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/convert_generation.py", line 2699, in <module>
    main()
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/convert_generation.py", line 2681, in main
    convert_generation_model(args)
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/convert_generation.py", line 1731, in convert_generation_model
    t5_to_onnx(args)
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/convert_generation.py", line 535, in t5_to_onnx
    paths = export_t5_onnx_models(
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/models/gpt2/../../models/t5/convert_to_onnx.py", line 198, in export_onnx_models
    T5Helper.optimize_onnx(
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/models/t5/t5_helper.py", line 248, in optimize_onnx
    m = optimize_model(
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/models/gpt2/../../optimizer.py", line 294, in optimize_model
    optimizer = optimize_by_fusion(model, model_type, num_heads, hidden_size, optimization_options)
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/models/gpt2/../../optimizer.py", line 178, in optimize_by_fusion
    optimizer = optimizer_class(model, num_heads, hidden_size)
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/models/gpt2/../../onnx_model_t5.py", line 751, in __init__
    super().__init__(model, num_heads, hidden_size)
  File "/opt/conda/envs/jupyter/lib/python3.10/site-packages/onnxruntime/transformers/models/gpt2/../../onnx_model_bert.py", line 51, in __init__
    assert (num_heads == 0 and hidden_size == 0) or (num_heads > 0 and hidden_size % num_heads == 0)
AssertionError

Then I tried again, but this time without nightly - just whatever version I get when I install optimum[onnxruntime-gpu], which appears to be onnxruntime-gpu 1.14.1

python -m onnxruntime.transformers.convert_generation -m google/flan-t5-small --model_type t5 -e --use_gpu --output ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx --output_sequences_scores --num_beams=5 --num_return_sequences=5 --length_penalty=0
**** past_present_share_buffer=False, is_greedysearch=False
Convert model google/flan-t5-small to onnx ...
Exporting ONNX model to output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx
/opt/conda/envs/jupyter/lib/python3.10/site-packages/transformers/modeling_utils.py:832: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if causal_mask.shape[1] < attention_mask.shape[1]:
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Delete the existed onnx file: output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx
Delete the existed external data file: output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx.data
batch_size=4 encode_sequence_length=11, max_diff=0.013242721557617188
batch_size=1 encode_sequence_length=2, max_diff=7.152557373046875e-06
batch_size=3 encode_sequence_length=1, max_diff=0.0006928443908691406
batch_size=8 encode_sequence_length=5, max_diff=0.019624948501586914
PyTorch and OnnxRuntime results max difference = 0.019624948501586914
PyTorch and OnnxRuntime results are NOT close
Exporting ONNX model to output/models/t5/onnx_models/google/flan-t5-small_decoder.onnx
/opt/conda/envs/jupyter/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:507: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  elif past_key_value.shape[2] != key_value_states.shape[1]:
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Delete the existed onnx file: output/models/t5/onnx_models/google/flan-t5-small_decoder.onnx
Delete the existed external data file: output/models/t5/onnx_models/google/flan-t5-small_decoder.onnx.data
batch_size=4, encode_sequence_length=11, past_decode_sequence_length=3, max_diff=0.013082504272460938
batch_size=1, encode_sequence_length=2, past_decode_sequence_length=5, max_diff=4.76837158203125e-06
batch_size=3, encode_sequence_length=1, past_decode_sequence_length=1, max_diff=0.0006706714630126953
batch_size=8, encode_sequence_length=5, past_decode_sequence_length=2, max_diff=0.012537002563476562
PyTorch and OnnxRuntime results max difference = 0.012537002563476562
PyTorch and OnnxRuntime results are NOT close
T5 encoder graph verified: name and data type of inputs and outputs are good.
26 shared initializers (['s_d_decoder.embed_tokens.weight', 's_d_onnx::MatMul_1366', 's_d_onnx::MatMul_1367', 's_d_onnx::MatMul_1368', 's_d_onnx::MatMul_1391', 's_d_onnx::MatMul_1392', 's_d_onnx::MatMul_1393', 's_d_onnx::MatMul_1416', 's_d_onnx::MatMul_1417', 's_d_onnx::MatMul_1418', 's_d_onnx::MatMul_1441', 's_d_onnx::MatMul_1442', 's_d_onnx::MatMul_1443', 's_d_onnx::MatMul_1466', 's_d_onnx::MatMul_1467', 's_d_onnx::MatMul_1468', 's_d_onnx::MatMul_1491', 's_d_onnx::MatMul_1492', 's_d_onnx::MatMul_1493', 's_d_onnx::MatMul_1516', 's_d_onnx::MatMul_1517', 's_d_onnx::MatMul_1518', 's_d_onnx::MatMul_1541', 's_d_onnx::MatMul_1542', 's_d_onnx::MatMul_1543', 's_d_onnx::MatMul_1544']) in encoder and decoder subgraphs are moved to the main graph
Delete the existed onnx file: ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx
Delete the existed external data file: ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx.data
model save to ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx
start testing model...
--------------------------------------------------
Test PyTorch model and beam search with huggingface transformers...
input_ids tensor([[    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0, 13959,  1566,    12,  2379,    10,
            37,   556,    19,  1883,     1],
        [21603,    10,   585,  3256,    12,   504,    24,  8636,   830,   490,
           533,  1393,    12,    70,  2713,     5,  3985,     3,     9,  1782,
           300,    54,   991,    12,  1364,  1425,    13,  2189,    21,   321,
          3513,    11,  1082,     5,     1]])
huggingface transformers outputs:
sequences tensor([[    0,   312,  5789,   259, 13833,     1,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   312,  5789,   259, 13833,     5,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   325,   556,   259, 27549,   721,     5,     1,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   325,   556,   259, 13833,    15,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   325,   556,   259, 13833,    15,     5,     1,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
           248,   194,    12,  1428,  2189,     5,     1,     0,     0,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
             9,   248,   194,    12,  1428,  2189,     5,     1,     0,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
             9,   248,   194,    12,  4888,    39,  1879,   533,     5,     1],
        [    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
           248,   194,    12,  1428,  2189,    11,  2189,     5,     1,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
             9,   248,   194,    12,  4888,    39,  1879, 19016,     5,     1]])
sequences_scores tensor([ -4.6146,  -4.8925,  -5.4252,  -5.4902,  -5.7130, -17.0220, -18.1770,
        -19.5306, -19.7267, -20.4909])
0: Le produit est publié
1: Le produit est publié.
2: La product est libérée.
3: La product est publiée
4: La product est publiée.
5: Keeping a dog around is a great way to reduce stress.
6: Keeping a dog around can be a great way to reduce stress.
7: Keeping a dog around can be a great way to boost your overall health.
8: Keeping a dog around is a great way to reduce stress and stress.
9: Keeping a dog around can be a great way to boost your overall wellbeing.
--------------------------------------------------
Testing beam search with onnxruntime...
use CUDAExecutionProvider
ORT outputs:
sequences [[[    0   312  5789   259 13833     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833     5     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833    15     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833    15     5     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833     3    85     3    40    31   154
    9456   257     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]]

 [[    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189     5     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11  1082     5     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11   502     5     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11  1082  9391     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11  1082  9391     5     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]]]
sequences_scores [[ -47.827347  -57.368763  -57.407936  -66.94655  -124.13513 ]
 [-143.51146  -190.83138  -190.88016  -191.01337  -200.34917 ]]
batch 0 sequence 0: Le produit est publié
batch 0 sequence 1: Le produit est publié.
batch 0 sequence 2: Le produit est publiée
batch 0 sequence 3: Le produit est publiée.
batch 0 sequence 4: Le produit est publié à l'élaboration
batch 1 sequence 0: Keeping a dog is a great way to reduce stress.
batch 1 sequence 1: Keeping a dog is a great way to reduce stress for both adults and kids.
batch 1 sequence 2: Keeping a dog is a great way to reduce stress for both adults and children.
batch 1 sequence 3: Keeping a dog is a great way to reduce stress for both adults and kids alike
batch 1 sequence 4: Keeping a dog is a great way to reduce stress for both adults and kids alike.
--------------------------------------------------
Torch Sequences:
tensor([[[    0,   312,  5789,   259, 13833,     1,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,     5,     1,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   325,   556,   259, 27549,   721,     5,     1,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   325,   556,   259, 13833,    15,     1,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   325,   556,   259, 13833,    15,     5,     1,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],

        [[    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
            248,   194,    12,  1428,  2189,     5,     1,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
              9,   248,   194,    12,  1428,  2189,     5,     1,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
              9,   248,   194,    12,  4888,    39,  1879,   533,     5,     1],
         [    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
            248,   194,    12,  1428,  2189,    11,  2189,     5,     1,     0],
         [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
              9,   248,   194,    12,  4888,    39,  1879, 19016,     5,     1]]])
['Le produit est publié', 'Le produit est publié.', 'La product est libérée.', 'La product est publiée', 'La product est publiée.', 'Keeping a dog around is a great way to reduce stress.', 'Keeping a dog around can be a great way to reduce stress.', 'Keeping a dog around can be a great way to boost your overall health.', 'Keeping a dog around is a great way to reduce stress and stress.', 'Keeping a dog around can be a great way to boost your overall wellbeing.']
--------------------------------------------------
ORT Sequences:
tensor([[[    0,   312,  5789,   259, 13833,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,     5,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,    15,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,    15,     5,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,     3,    85,     3,    40,    31,
            154,  9456,   257,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],

        [[    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,     5,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,  1082,     5,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,   502,     5,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,  1082,  9391,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,  1082,  9391,
              5,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]])
['Le produit est publié', 'Le produit est publié.', 'Le produit est publiée', 'Le produit est publiée.', "Le produit est publié à l'élaboration", 'Keeping a dog is a great way to reduce stress.', 'Keeping a dog is a great way to reduce stress for both adults and kids.', 'Keeping a dog is a great way to reduce stress for both adults and children.', 'Keeping a dog is a great way to reduce stress for both adults and kids alike', 'Keeping a dog is a great way to reduce stress for both adults and kids alike.']
--------------------------------------------------
Torch and ORT result is  different
ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '282.23', 'latency_95_percentile': '282.23', 'latency_99_percentile': '282.23', 'average_latency_ms': '282.23', 'QPS': '7.09', 'parity': False}
Output files: ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx, ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx.data
(jupyter) ubuntu@ip-172-31-3-120:~/sky_workdir$ ls ./output/models/t5/onnx_models/
flan_t5_base_beam_search.onnx  flan_t5_base_beam_search.onnx.data  flan_t5_small_beam_search.onnx  flan_t5_small_beam_search.onnx.data  google
(jupyter) ubuntu@ip-172-31-3-120:~/sky_workdir$ rm -rf ./output/models/t5/onnx_models/
(jupyter) ubuntu@ip-172-31-3-120:~/sky_workdir$ python -m onnxruntime.transformers.convert_generation -m google/flan-t5-small --model_type t5 -e --use_gpu --output ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx --output_sequences_scores --num_beams=5 --num_return_sequences=5 --length_penalty=0
**** past_present_share_buffer=False, is_greedysearch=False
Convert model google/flan-t5-small to onnx ...
Exporting ONNX model to output/models/t5/onnx_models/google/flan-t5-small_encoder_decoder_init.onnx
/opt/conda/envs/jupyter/lib/python3.10/site-packages/transformers/modeling_utils.py:832: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if causal_mask.shape[1] < attention_mask.shape[1]:
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

batch_size=4 encode_sequence_length=11, max_diff=0.015545368194580078
batch_size=1 encode_sequence_length=2, max_diff=4.291534423828125e-06
batch_size=3 encode_sequence_length=1, max_diff=0.0006568431854248047
batch_size=8 encode_sequence_length=5, max_diff=0.01646888256072998
PyTorch and OnnxRuntime results max difference = 0.01646888256072998
PyTorch and OnnxRuntime results are NOT close
Exporting ONNX model to output/models/t5/onnx_models/google/flan-t5-small_decoder.onnx
/opt/conda/envs/jupyter/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:507: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  elif past_key_value.shape[2] != key_value_states.shape[1]:
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

batch_size=4, encode_sequence_length=11, past_decode_sequence_length=3, max_diff=0.008085250854492188
batch_size=1, encode_sequence_length=2, past_decode_sequence_length=5, max_diff=7.62939453125e-06
batch_size=3, encode_sequence_length=1, past_decode_sequence_length=1, max_diff=0.00025200843811035156
batch_size=8, encode_sequence_length=5, past_decode_sequence_length=2, max_diff=0.01223444938659668
PyTorch and OnnxRuntime results max difference = 0.01223444938659668
PyTorch and OnnxRuntime results are NOT close
T5 encoder graph verified: name and data type of inputs and outputs are good.
26 shared initializers (['s_d_decoder.embed_tokens.weight', 's_d_onnx::MatMul_1366', 's_d_onnx::MatMul_1367', 's_d_onnx::MatMul_1368', 's_d_onnx::MatMul_1391', 's_d_onnx::MatMul_1392', 's_d_onnx::MatMul_1393', 's_d_onnx::MatMul_1416', 's_d_onnx::MatMul_1417', 's_d_onnx::MatMul_1418', 's_d_onnx::MatMul_1441', 's_d_onnx::MatMul_1442', 's_d_onnx::MatMul_1443', 's_d_onnx::MatMul_1466', 's_d_onnx::MatMul_1467', 's_d_onnx::MatMul_1468', 's_d_onnx::MatMul_1491', 's_d_onnx::MatMul_1492', 's_d_onnx::MatMul_1493', 's_d_onnx::MatMul_1516', 's_d_onnx::MatMul_1517', 's_d_onnx::MatMul_1518', 's_d_onnx::MatMul_1541', 's_d_onnx::MatMul_1542', 's_d_onnx::MatMul_1543', 's_d_onnx::MatMul_1544']) in encoder and decoder subgraphs are moved to the main graph
model save to ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx
start testing model...
--------------------------------------------------
Test PyTorch model and beam search with huggingface transformers...
input_ids tensor([[    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0, 13959,  1566,    12,  2379,    10,
            37,   556,    19,  1883,     1],
        [21603,    10,   585,  3256,    12,   504,    24,  8636,   830,   490,
           533,  1393,    12,    70,  2713,     5,  3985,     3,     9,  1782,
           300,    54,   991,    12,  1364,  1425,    13,  2189,    21,   321,
          3513,    11,  1082,     5,     1]])
huggingface transformers outputs:
sequences tensor([[    0,   312,  5789,   259, 13833,     1,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   312,  5789,   259, 13833,     5,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   325,   556,   259, 27549,   721,     5,     1,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   325,   556,   259, 13833,    15,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,   325,   556,   259, 13833,    15,     5,     1,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
           248,   194,    12,  1428,  2189,     5,     1,     0,     0,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
             9,   248,   194,    12,  1428,  2189,     5,     1,     0,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
             9,   248,   194,    12,  4888,    39,  1879,   533,     5,     1],
        [    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
           248,   194,    12,  1428,  2189,    11,  2189,     5,     1,     0],
        [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
             9,   248,   194,    12,  4888,    39,  1879, 19016,     5,     1]])
sequences_scores tensor([ -4.6146,  -4.8925,  -5.4252,  -5.4902,  -5.7130, -17.0220, -18.1770,
        -19.5306, -19.7267, -20.4909])
0: Le produit est publié
1: Le produit est publié.
2: La product est libérée.
3: La product est publiée
4: La product est publiée.
5: Keeping a dog around is a great way to reduce stress.
6: Keeping a dog around can be a great way to reduce stress.
7: Keeping a dog around can be a great way to boost your overall health.
8: Keeping a dog around is a great way to reduce stress and stress.
9: Keeping a dog around can be a great way to boost your overall wellbeing.
--------------------------------------------------
Testing beam search with onnxruntime...
use CUDAExecutionProvider
ORT outputs:
sequences [[[    0   312  5789   259 13833     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833     5     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833    15     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833    15     5     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0   312  5789   259 13833     3    85     3    40    31   154
    9456   257     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]]

 [[    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189     5     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11  1082     5     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11   502     5     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11  1082  9391     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]
  [    0     3 18536     3     9  1782    19     3     9   248   194
      12  1428  2189    21   321  3513    11  1082  9391     5     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0     0     0     0     0     0
       0     0     0     0     0     0]]]
sequences_scores [[ -47.827347  -57.368763  -57.407936  -66.94655  -124.13513 ]
 [-143.51146  -190.83138  -190.88016  -191.01337  -200.34917 ]]
batch 0 sequence 0: Le produit est publié
batch 0 sequence 1: Le produit est publié.
batch 0 sequence 2: Le produit est publiée
batch 0 sequence 3: Le produit est publiée.
batch 0 sequence 4: Le produit est publié à l'élaboration
batch 1 sequence 0: Keeping a dog is a great way to reduce stress.
batch 1 sequence 1: Keeping a dog is a great way to reduce stress for both adults and kids.
batch 1 sequence 2: Keeping a dog is a great way to reduce stress for both adults and children.
batch 1 sequence 3: Keeping a dog is a great way to reduce stress for both adults and kids alike
batch 1 sequence 4: Keeping a dog is a great way to reduce stress for both adults and kids alike.
--------------------------------------------------
Torch Sequences:
tensor([[[    0,   312,  5789,   259, 13833,     1,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,     5,     1,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   325,   556,   259, 27549,   721,     5,     1,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   325,   556,   259, 13833,    15,     1,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   325,   556,   259, 13833,    15,     5,     1,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],

        [[    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
            248,   194,    12,  1428,  2189,     5,     1,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
              9,   248,   194,    12,  1428,  2189,     5,     1,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
              9,   248,   194,    12,  4888,    39,  1879,   533,     5,     1],
         [    0,     3, 18536,     3,     9,  1782,   300,    19,     3,     9,
            248,   194,    12,  1428,  2189,    11,  2189,     5,     1,     0],
         [    0,     3, 18536,     3,     9,  1782,   300,    54,    36,     3,
              9,   248,   194,    12,  4888,    39,  1879, 19016,     5,     1]]])
['Le produit est publié', 'Le produit est publié.', 'La product est libérée.', 'La product est publiée', 'La product est publiée.', 'Keeping a dog around is a great way to reduce stress.', 'Keeping a dog around can be a great way to reduce stress.', 'Keeping a dog around can be a great way to boost your overall health.', 'Keeping a dog around is a great way to reduce stress and stress.', 'Keeping a dog around can be a great way to boost your overall wellbeing.']
--------------------------------------------------
ORT Sequences:
tensor([[[    0,   312,  5789,   259, 13833,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,     5,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,    15,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,    15,     5,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,   312,  5789,   259, 13833,     3,    85,     3,    40,    31,
            154,  9456,   257,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],

        [[    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,     5,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,  1082,     5,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,   502,     5,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,  1082,  9391,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
         [    0,     3, 18536,     3,     9,  1782,    19,     3,     9,   248,
            194,    12,  1428,  2189,    21,   321,  3513,    11,  1082,  9391,
              5,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]])
['Le produit est publié', 'Le produit est publié.', 'Le produit est publiée', 'Le produit est publiée.', "Le produit est publié à l'élaboration", 'Keeping a dog is a great way to reduce stress.', 'Keeping a dog is a great way to reduce stress for both adults and kids.', 'Keeping a dog is a great way to reduce stress for both adults and children.', 'Keeping a dog is a great way to reduce stress for both adults and kids alike', 'Keeping a dog is a great way to reduce stress for both adults and kids alike.']
--------------------------------------------------
Torch and ORT result is  different
ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '248.05', 'latency_95_percentile': '248.05', 'latency_99_percentile': '248.05', 'average_latency_ms': '248.05', 'QPS': '8.06', 'parity': False}
Output files: ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx, ./output/models/t5/onnx_models/flan_t5_small_beam_search.onnx.data

I presume that the Torch and ORT result being different is a legitimate error?

microsoft / onnxruntime

[Feature Request] GPU Graph optimization for Flan-T5-Large #14886

Describe the feature request

Describe scenario use case