microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.12k stars 2.85k forks source link

CUDA Cross-attention kernel #10047

Open romain-keramitas-prl opened 2 years ago

romain-keramitas-prl commented 2 years ago

Is your feature request related to a problem? Please describe.

I'm able to use the onnxruntime.transformers codebase to optimize Tranformer-based model using self-attention, however it's not possible to use the self-attention kernel for cross-attention.

System information

Describe the solution you'd like

I would like to know if the implementation of a CUDA kernel for cross-attention is something you've considered adding to ONNXRuntime - or simply a modification of the current self-attention kernel to take in one input for queries and one input for keys and values.

Describe alternatives you've considered

For generative models I think the self-attention kernel can be used after a first pass, as we can simply reuse past keys and values. However that is not the case more generally, when you only perform one inference on a given pair of input.

romain-keramitas-prl commented 2 years ago

Actually my bad ! This is already implemented apparently under the name DecoderAttention, but in this documentation file I was pointed to a previous version of the repository, instead of the most recent one :S

ytaous commented 2 years ago

Thanks, can you please point out which part of the doc should be corrected?

@wangyems

wangyems commented 2 years ago

Hi @rom1K, the link to the doc is based on rel-1.9 for stability reason. Please refer to the master branch or rel-1.10 for any most updated features.

Please also note that we only have DecoderAttention cuda op and there is no corresponding cpu op and graph transformation code for it at the moment.

romain-keramitas-prl commented 2 years ago

Hey @wangyems ! Okay thanks for the explanation :) SInce I was on the master branch when reading the doc I was surprised when it pointed me right back to the previous release, but I guess it makes sense.

Please also note that we only have DecoderAttention cuda op and there is no corresponding cpu op and graph transformation code for it at the moment.

In my case not a problem, I'm using a GPU and I've already had to modify the graph transformation code to adapt to my model so it shouldn't be too much of a hassle to adapt it some more.

romain-keramitas-prl commented 2 years ago

I think this issue can be closed, but I've got a couple of observations if anyone comes onto this issue, which could also perhaps orient future development of this kernel.

I've been able to modify the ONNX graph of my decoder layer without too many problems, however the way the kernel is made, compared to the regular (self) Attention kernel, makes the process a bit tedious. Specifically:

I think if @wangyems or another maintainer has some time it would be cool to address some of these points, be it simply through documentation or by modifying the kernel. Finally, I must say that I haven't yet been able to see any performance increase when converting the model to FP16. I don't know if it's due to unnecessary casting back and from FP16 to FP32 between inputs and outputs, or if it's due to something else, like the kernel itself. It's pretty strange, since for the encoder conterpart I've seen a pretty significant speedup.

Anyway, although I may seem somewhat critical here, I'd like to thank you very much for your effort and work, all of this is awesome and I'm really having fun with it :100:

ytaous commented 2 years ago

will leave it open as an enhancement for review, thx

wangyems commented 2 years ago

@rom1K Thank you very much for the comments!