Closed vibhas-singh closed 2 months ago
- BetterTransformer will do more optimizations than just replace the model's attention implementation
Why, in that case, is the recommendation in the documentation for optimum
to deprecate BetterTransformer where SDPA is available?
I'm not exactly sure what BetterTransformer is doing but I have observed that it is able to significantly speed up my models (typically encoder models) on Windows despite flash attention not being available. Trying to use SDPA on Windows has, from my memory, not worked.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Encoder models might not alll have sdpa available in transformers directly!
We checked the performance between sdpa and BetterTransformers on my companies project, we observed no difference in performance.
yep that is what we expect!
System Info
transformers
version: 4.41.1Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to optimise a fine-tuned BERT model for sequence classification using lower precision and
SDPA
. I am observing different behaviour while opting forSDPA
using native transformers as compared to usingBetterTransformers
.I have a local dataset and I am using that for recording the inference time for different settings - any dummy dataset or any dummy model can be used to reproduce the beviour. Every experiment done is using same dataset.
batch_size
is 128 andmax_length
is 128 for all the runs. Model performance is unchanged for all the runs. GPU: A10GExperiment 1
Experiment 2
Experiment 3
Experiment 4
Experiment 5
Experiment 6
Expected behavior
SDPA
implementations inTransformers
vsBetterTransformers
? Because I am able to achieve much better performance in terms of inference time usingBetterTransformers
as compared toTransformers
(compare Exp 6 with Exp 3/4) - which isn't intuitive. Ideally, both should be same.