@xrc10 In the paper, src-tgt attention on sentences is after the src-tgt attention on tokens. However, in the code, the order is opposite. At line 1000 in MeetingNet_Transformer.py,

def forward(self, y, token_enc_key, token_enc_value, sent_enc_key, sent_enc_value): query, key, value = self.decoder_splitter(y)

batch x len x n_state

    # self-attention
    a = self.attn(query, key, value, None, one_dir_visible=True)
    # batch x len x n_state

    n = self.ln_1(y + a) # residual

    if 'NO_HIERARCHY' in self.opt:
        q = y
        r = n
    else:
        # src-tgt attention on sentences
        q = self.sent_attn(n, sent_enc_key, sent_enc_value, None)
        r = self.ln_3(n + q) # residual
        # batch x len x n_state

    # src-tgt attention on tokens
    o = self.token_attn(r, token_enc_key, token_enc_value, None)
    p = self.ln_2(r + o) # residual
    # batch x len x n_state

    m = self.mlp(p)
    h = self.ln_4(p + m)
    return

I would like to confirm Is this intended code or not?

microsoft / HMNet

The order of token_attn and sent_attn in decoder is different between the code and the paper, in MeetingNet_Transformer.py #12

batch x len x n_state