Revisit implementation of Multi-head attention

Currently there are two implementations of Multi-head attention. The one in use at the moment can be found in astronet.t2.attention.py with the other found in astronet.t2.multihead_attention.py

The current one does not use mask-ing whilst the other does, it is not certain if this is something that is required for the set up we are going for with Supernova.

There exists unit tests for the astronet.t2.multihead_attention.py implementation, but not for the one that is currently in use, this should be addressed. The astronet.t2.multihead_attention.py also returns outputs and attention_weights

Next steps would be to implement tests for the one in use, and compare implementations to decide which is best.

Refs:

astronet.t2.multihead_attention.py : https://medium.com/@burnmg/software-testing-in-tensorflow-2-0-33c440ca908c
astronet.t2.attention.py : https://keras.io/examples/nlp/text_classification_with_transformer/

tallamjr / astronet

Revisit implementation of Multi-head attention #38