Hello, I would like to know the reason why you use Multi-Head Conv Attention instead of Multi-Head Attention. It seems that the current parameter configuration (n_qx_stride=1, n_kv_stride=1) does not reflect the effect of MHCA.
While the default parameters for this class are n_qx_stride=1, n_kv_stride=1, these parameters are overwritten based on the model config file. See the code here and here
Hello, I would like to know the reason why you use Multi-Head Conv Attention instead of Multi-Head Attention. It seems that the current parameter configuration (n_qx_stride=1, n_kv_stride=1) does not reflect the effect of MHCA.