CUDA Cross-attention kernel

romain-keramitas-prl commented 2 years ago

Is your feature request related to a problem? Please describe.

I'm able to use the onnxruntime.transformers codebase to optimize Tranformer-based model using self-attention, however it's not possible to use the self-attention kernel for cross-attention.

System information

ONNX Runtime version (you are using): 1.10.0

Describe the solution you'd like

I would like to know if the implementation of a CUDA kernel for cross-attention is something you've considered adding to ONNXRuntime - or simply a modification of the current self-attention kernel to take in one input for queries and one input for keys and values.

Describe alternatives you've considered

For generative models I think the self-attention kernel can be used after a first pass, as we can simply reuse past keys and values. However that is not the case more generally, when you only perform one inference on a given pair of input.

romain-keramitas-prl commented 2 years ago

Actually my bad ! This is already implemented apparently under the name DecoderAttention, but in this documentation file I was pointed to a previous version of the repository, instead of the most recent one :S

ytaous commented 2 years ago

Thanks, can you please point out which part of the doc should be corrected?

@wangyems

wangyems commented 2 years ago

Hi @rom1K, the link to the doc is based on rel-1.9 for stability reason. Please refer to the master branch or rel-1.10 for any most updated features.

Please also note that we only have DecoderAttention cuda op and there is no corresponding cpu op and graph transformation code for it at the moment.

romain-keramitas-prl commented 2 years ago

Hey @wangyems ! Okay thanks for the explanation :) SInce I was on the master branch when reading the doc I was surprised when it pointed me right back to the previous release, but I guess it makes sense.

Please also note that we only have DecoderAttention cuda op and there is no corresponding cpu op and graph transformation code for it at the moment.

In my case not a problem, I'm using a GPU and I've already had to modify the graph transformation code to adapt to my model so it shouldn't be too much of a hassle to adapt it some more.

romain-keramitas-prl commented 2 years ago

I think this issue can be closed, but I've got a couple of observations if anyone comes onto this issue, which could also perhaps orient future development of this kernel.

I've been able to modify the ONNX graph of my decoder layer without too many problems, however the way the kernel is made, compared to the regular (self) Attention kernel, makes the process a bit tedious. Specifically:

optional values for the key / value cache must be specified or the model can't be used, so I had to put dummy inputs, resulting in warning about shape mismatch
since this kernel expects the sequence length before the batch size instead of the other way around I had to add transpose nodes before and after each cross-attention block, which probably slows down computation
the key mask had to be cast to bool instead of int32, unlike the previous kernel
the key mask is inverted: at first, I had NaN values outputted because the mask expects zeros for non-padding tokens instead of ones !
the documentation regarding the has_layer_state is pretty lacking, no idea how to use this
the past key / value must be inputted separately, unlike the previous kernel

I think if @wangyems or another maintainer has some time it would be cool to address some of these points, be it simply through documentation or by modifying the kernel. Finally, I must say that I haven't yet been able to see any performance increase when converting the model to FP16. I don't know if it's due to unnecessary casting back and from FP16 to FP32 between inputs and outputs, or if it's due to something else, like the kernel itself. It's pretty strange, since for the encoder conterpart I've seen a pretty significant speedup.

Anyway, although I may seem somewhat critical here, I'd like to thank you very much for your effort and work, all of this is awesome and I'm really having fun with it :100:

ytaous commented 2 years ago

will leave it open as an enhancement for review, thx

wangyems commented 2 years ago

@rom1K Thank you very much for the comments!

microsoft / onnxruntime

CUDA Cross-attention kernel #10047