Closed aimanmutasem closed 4 years ago
Thanks for the kinds words.
Your code is instantiating a new Pytorch nn.Module with randomly initialized weights on each forward pass. As with every other Pytorch nn.Module, you must instantiate the routing module at initialization (e.g., self.my_routing_module = Routing(...)) then use it in the forward pass (e.g., a_out, mu_out, sig2_out = self.my_routing_module(...)) so it can learn during training.
If you have more questions about the basic usage of Pytorch, please ask them in a Pytorch forum, not here.
Good luck!
Dear Mr. Heinsen,
Thank you so much for the magnificent article and its simple implementation. I have enjoyed reading the article several times.
I'm trying to apply your heinsen routing machines with the NMT task with TRANSFORMER-BASE architecture.
The idea is to use the output of each head's attention as an input capsule at the (encoder and decoder) side. Then concatenate the output couples and forwarded to the up layer which is feed-forward network (FFN) layer.
The problem is the results are poor and the delay time increased by 500% (I think because routing process implementation in the CPU).
I don't know where I messed and I think there is something wrong during the implementation.
I hope to support me to find the optimal way to apply your machines to my project.
Sincerely, Aiman
Please, take a look at the implementation code and the attached diagram.
`class MultiHeadAttentionLayer(nn.Module):