Performance for summarization task on BART is low after latest Transformer 4.40 upgrade

System Info

Bad
Optimum Habana latest main: c495f479d9abf04fb7adb6f0a5607d7963186649
Synapse docker image: v1.16

Good:
Optimum Habana one commit before Transformer 4.40 upgrade: 569580ff9bf44083514533ad28e336043891947b
Synapse docker image: v1.16

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cd /root/optimum-habana/examples/summarization pip install -r requirements.txt PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=1 python run_summarization.py --model_name_or_path facebook/bart-large-cnn --do_predict --predict_with_generate --dataset_name cnn_dailymail --dataset_config \"3.0.0\" --output_dir ./tst-summarization --overwrite_output_dir --per_device_eval_batch_size 2 --use_habana --use_lazy_mode --use_hpu_graphs_for_inference --gaudi_config_name Habana/t5 --ignore_pad_token_for_loss False --pad_to_max_length --num_beams 1 --generation_num_beams 1 --bf16 --ignore_eos False

Expected behavior

The quickest way to check if something is wrong is observe performance.

Before Transformer 4.40 upgrade the speed is ~3.9 it/s After Transformer 4.40 upgrade the speed is ~1.7 it/s

huggingface / optimum-habana