Open justheuristic opened 2 years ago
@krunt encountered a fused monolithic implementation of multi-head attention: https://github.com/NVIDIA/apex/tree/master/apex/contrib/csrc/fmha main stuff: ./apex/contrib/fmha/fmha.py / ./apex/contrib/csrc/fmha
Let'c compare how it performs v.s. vanilla pytorch MHA on sequence length 512 & hid size 4096; 64 heads
@krunt encountered a fused monolithic implementation of multi-head attention: https://github.com/NVIDIA/apex/tree/master/apex/contrib/csrc/fmha main stuff: ./apex/contrib/fmha/fmha.py / ./apex/contrib/csrc/fmha
Let'c compare how it performs v.s. vanilla pytorch MHA on sequence length 512 & hid size 4096; 64 heads