glassroom / heinsen_routing

Reference implementation of "An Algorithm for Routing Vectors in Sequences" (Heinsen, 2022) and "An Algorithm for Routing Capsules in All Domains" (Heinsen, 2019), for composing deep neural networks.
MIT License
163 stars 17 forks source link

Capsule network with multi-head-attention layer #2

Closed aimanmutasem closed 3 years ago

aimanmutasem commented 3 years ago

Dear Prof. Heinsen,

Good day.

Firstly, I so sorry to open a new issue again.

I'm trying to apply Capsule network for Neural Machine Translation (NMT) task based on Multi-head attention network since June last year. Motivated by a previous study in AAAI2019, authors applied the old version of CapsNet to route the values (output) of the multi-head attention layer, dynamicly. Unfortunately, I'm facing some logical difficulties and I don't get the expected results.

In my project, when I applied Capsole network with encoder and decoder specifically for multi-head attention layer, I got so bad results. Moreover, CapsNet with the only encoder doesn't make any improvments.

Hope you can give me some tips to apply CapsNet with Multi-head attention layer, correctly.

Sample of code :

a for activation A mu fo main µ

class DynamicAttentionAggregationLayer(nn.Module):

    def __init__(self, d_model, n_heads):
        super(DynamicAttentionAggregationLayer, self).__init__()
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_size = d_model // n_heads
        self.fc_q = nn.Linear(d_model, d_model)
        self.fc_k = nn.Linear(d_model, d_model)
        self.fc_v = nn.Linear(d_model, d_model)
        self.fc_o = nn.Linear(d_model, d_model)

        self.routings =  nn.Sequential(Routing(d_cov=8, d_inp=32, d_out=32))

    def forward(self, query, key, value, mask):
        """
        :param Tensor[batch_size, q_len, d_model] query
        :param Tensor[batch_size, k_len, d_model] key
        :param Tensor[batch_size, v_len, d_model] value
        :param Tensor[batch_size, ..., k_len] mask
        :return Tensor[batch_size, q_len, d_model] context
        :return Tensor[batch_size, n_heads, q_len, k_len] attention_weights
        """
        Q = self.fc_q(query) # [batch_size, q_len, d_model]
        K = self.fc_k(key) # [batch_size, k_len, d_model]
        V = self.fc_v(value) # [batch_size, v_len, d_model]

        Q = Q.view(Q.size(0), -1, self.n_heads, self.head_size).permute(0, 2, 1, 3) # [batch_size, n_heads, q_len, head_size]
        K = K.view(K.size(0), -1, self.n_heads, self.head_size).permute(0, 2, 1, 3) # [batch_size, n_heads, k_len, head_size]
        V = V.view(V.size(0), -1, self.n_heads, self.head_size).permute(0, 2, 1, 3) # [batch_size, n_heads, v_len, head_size]

        scores = torch.matmul(Q, K.transpose(-1, -2)) # [batch_size, n_heads, q_len, k_len]
        scores = scores / torch.sqrt(torch.FloatTensor([self.head_size]).to(Q.device))
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e18)
        attention_weights = F.softmax(scores , dim=-1) # [batch_size, n_heads, q_len, k_len]                

        # Routing ---------------------------------------------------------------------
        mu = torch.matmul(attention_weights, V) # [batch_size, n_heads, q_len, v_len]
        mu = mu.permute(0, 2, 1, 3).contiguous() # [batch_size, q_len, n_heads, v_len]
        a = torch.ones(mask.shape[0], mu.shape[1]).to(DEVICE)

        for Routing in self.routings:
            a, mu, sig2 = Routing(a, mu)

        context = mu.view(mu.size(0), -1, self.d_model)
        context = self.fc_o(context) # [batch_size, q_len, d_model]    
        # -----------------------------------------------------------------------------

        return context, attention_weights

routing

fheinsen commented 3 years ago

Hi -- I have never reproduced Due et al's results, so I can't comment on how to implement their models. Moreover, my ability to study, understand, review, and debug other people's code is rather limited. In any case, it appears you're trying to do something new, so you may want to approach the exercise with an experimental mindset. I would recommend reading Andrej Karpathy's "how-to" recipe for training neural networks, which has lots of tips and tricks: http://karpathy.github.io/2019/04/25/recipe/