Capsule network with multi-head-attention layer

Dear Prof. Heinsen,

Good day.

Firstly, I so sorry to open a new issue again.

I'm trying to apply Capsule network for Neural Machine Translation (NMT) task based on Multi-head attention network since June last year. Motivated by a previous study in AAAI2019, authors applied the old version of CapsNet to route the values (output) of the multi-head attention layer, dynamicly. Unfortunately, I'm facing some logical difficulties and I don't get the expected results.

In my project, when I applied Capsole network with encoder and decoder specifically for multi-head attention layer, I got so bad results. Moreover, CapsNet with the only encoder doesn't make any improvments.

Hope you can give me some tips to apply CapsNet with Multi-head attention layer, correctly.

Sample of code :

a for activation A mu fo main µ

class DynamicAttentionAggregationLayer(nn.Module):

    def __init__(self, d_model, n_heads):
        super(DynamicAttentionAggregationLayer, self).__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_size = d_model // n_heads
        self.fc_q = nn.Linear(d_model, d_model)
        self.fc_k = nn.Linear(d_model, d_model)
        self.fc_v = nn.Linear(d_model, d_model)
        self.fc_o = nn.Linear(d_model, d_model)

        self.routings =  nn.Sequential(Routing(d_cov=8, d_inp=32, d_out=32))

    def forward(self, query, key, value, mask):
        """
        :param Tensor[batch_size, q_len, d_model] query
        :param Tensor[batch_size, k_len, d_model] key
        :param Tensor[batch_size, v_len, d_model] value
        :param Tensor[batch_size, ..., k_len] mask
        :return Tensor[batch_size, q_len, d_model] context
        :return Tensor[batch_size, n_heads, q_len, k_len] attention_weights
        """
        Q = self.fc_q(query) # [batch_size, q_len, d_model]
        K = self.fc_k(key) # [batch_size, k_len, d_model]
        V = self.fc_v(value) # [batch_size, v_len, d_model]

        Q = Q.view(Q.size(0), -1, self.n_heads, self.head_size).permute(0, 2, 1, 3) # [batch_size, n_heads, q_len, head_size]
        K = K.view(K.size(0), -1, self.n_heads, self.head_size).permute(0, 2, 1, 3) # [batch_size, n_heads, k_len, head_size]
        V = V.view(V.size(0), -1, self.n_heads, self.head_size).permute(0, 2, 1, 3) # [batch_size, n_heads, v_len, head_size]

        scores = torch.matmul(Q, K.transpose(-1, -2)) # [batch_size, n_heads, q_len, k_len]
        scores = scores / torch.sqrt(torch.FloatTensor([self.head_size]).to(Q.device))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e18)
        attention_weights = F.softmax(scores , dim=-1) # [batch_size, n_heads, q_len, k_len]                

        # Routing ---------------------------------------------------------------------
        mu = torch.matmul(attention_weights, V) # [batch_size, n_heads, q_len, v_len]
        mu = mu.permute(0, 2, 1, 3).contiguous() # [batch_size, q_len, n_heads, v_len]
        a = torch.ones(mask.shape[0], mu.shape[1]).to(DEVICE)

        for Routing in self.routings:
            a, mu, sig2 = Routing(a, mu)

        context = mu.view(mu.size(0), -1, self.d_model)
        context = self.fc_o(context) # [batch_size, q_len, d_model]    
        # -----------------------------------------------------------------------------

        return context, attention_weights

routing

glassroom / heinsen_routing

Capsule network with multi-head-attention layer #2