T5-Small different output for decoder inference with CPU and DirectML EPs

Describe the issue

Hi team, I am currently running T5-Small model inference using OnnxRuntime. The model I am using to run the inference is - https://huggingface.co/Xenova/t5-small/tree/main/onnx

I tested the same model on CPU and DirectML execution providers and observed different outputs for the same input during the decoding stage.

encoder_model.onnx - This model is working as expected in both CPU and DirectML EPs.
decoder_model_merged.onnx - This model has outputs beyond acceptable range for CPU and DirectML. If anyone from ORT team can investigate, it would be really helpful.

I am attaching some results for CPU and DirectML comparisons for reference -

=== Comparing Encoder Outputs ===

Comparing Encoder outputs:
Shapes: (1, 12, 512) vs (1, 12, 512)

Statistics for first array:
  mean: -0.002746098442003131
  std: 0.12785771489143372
  min: -0.5774061679840088
  max: 0.5452761054039001
  abs_max: 0.5774061679840088
  has_nan: False
  has_inf: False

Statistics for second array:
  mean: -0.00274610030464828
  std: 0.1278577446937561
  min: -0.5774062871932983
  max: 0.5452762246131897
  abs_max: 0.5774062871932983
  has_nan: False
  has_inf: False

Difference analysis:
  Maximum absolute difference: 5.736947059631348e-07
  Mean absolute difference: 5.666575475515856e-08
  Maximum relative difference: 0.07109003514051437
  Position of max difference: (np.int64(0), np.int64(1), np.int64(401))
✅ Differences within acceptable threshold (1e-05)

=== Comparing Decoder Outputs ===

Comparing Decoder logits:
Shapes: (1, 1, 32128) vs (1, 1, 32128)

Statistics for first array:
  mean: -19.10366439819336
  std: 4.460851669311523
  min: -43.21986389160156
  max: -1.202622890472412
  abs_max: 43.21986389160156
  has_nan: False
  has_inf: False

Statistics for second array:
  mean: -19.10366439819336
  std: 4.460851669311523
  min: -43.21989059448242
  max: -1.2026221752166748
  abs_max: 43.21989059448242
  has_nan: False
  has_inf: False

Difference analysis:
  Maximum absolute difference: 5.7220458984375e-05
  Mean absolute difference: 7.175476639531553e-06
  Maximum relative difference: 2.00232352653984e-06
  Position of max difference: (np.int64(0), np.int64(0), np.int64(32113))
❌ Large difference detected! (> 1e-05)

Values at maximum difference point:
  Array1: -43.13878631591797
  Array2: -43.13884353637695

Surrounding values (if available):
  Array1 at [np.int64(0), np.int64(0), np.int64(32112)]: -43.058406829833984
  Array2 at [np.int64(0), np.int64(0), np.int64(32112)]: -43.058406829833984
  Array1 at [np.int64(0), np.int64(0), np.int64(32114)]: -43.1171760559082
  Array2 at [np.int64(0), np.int64(0), np.int64(32114)]: -43.11715316772461

To reproduce

Please run the above mentioned model using encode and decode methods.

Urgency

I would like to get this resolved by end of Dec 2024.

Platform

Windows

OS Version

Windows 11 Enterprise 22631.4169

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.20.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

DirectML 1.15.4

microsoft / onnxruntime