Closed upskyy closed 9 months ago
I referred to fairseq's conformer layer multi-head attention. [code] I also confirmed that it is training.
_relative_shift
B X n_head X T X 2T-1
B X n_head X T X T
Good job.
I referred to fairseq's conformer layer multi-head attention. [code] I also confirmed that it is training.
_relative_shift
methodB X n_head X T X 2T-1
B X n_head X T X T