pbloem / former

Simple transformer implementation from scratch in pytorch.
http://peterbloem.nl/blog/transformers
MIT License
1.04k stars 171 forks source link

improve performance in deeper network using your multi head attention code #29

Closed fatemehniknezhad closed 2 years ago

fatemehniknezhad commented 2 years ago

hi Mr. Peter. Thanks for the insightful blog on how to build transformers from scratch. I am a master student. I used your code for my thesis. In my code, after the embedding layer (pretrained embedding + position embedding) I only used a multi-head attention layer and that was given to a Bi-LSTM network as input. But the results do not improve due to the accuracy of the single Bi-LSTM model. What do you think is the reason for this problem? How do I fix it? thanks.

pbloem commented 2 years ago

Why do you expect it to improve?

It may be that the difference depends to much on hyperparameter tuning, or that it is too small to measure among all the various sources of noise you get in machine learning. It may also be that the LSTM already solves the problem as well as it can be solved.

It's important to note that self-attention is not "better" than LSTMs in all cases, and in all ways. The reason the field switched from RNNs to self-attention is that self-attention is easier to parallelize. RNNs were performing fine, but we hit a ceiling in how big we could make the models because of their serial nature. Self-attention allowed us to punch through this ceiling. However, it takes many layers of self-attention to be as powerful as one LSTM layer.

There were some succesful early models that combined the two (before we moved to self attention only). I think they mostly put the attention in-between the LSTM layers. I suggest you look at the models cited in the introduction of Vaswani et al 2017. This is the paper that heralded the move to attention-only, so the models they cite are the best examples of attention combined with something else.

fatemehniknezhad commented 2 years ago

Why do you expect it to improve?

It may be that the difference depends to much on hyperparameter tuning, or that it is too small to measure among all the various sources of noise you get in machine learning. It may also be that the LSTM already solves the problem as well as it can be solved.

It's important to note that self-attention is not "better" than LSTMs in all cases, and in all ways. The reason the field switched from RNNs to self-attention is that self-attention is easier to parallelize. RNNs were performing fine, but we hit a ceiling in how big we could make the models because of their serial nature. Self-attention allowed us to punch through this ceiling. However, it takes many layers of self-attention to be as powerful as one LSTM layer.

There were some succesful early models that combined the two (before we moved to self attention only). I think they mostly put the attention in-between the LSTM layers. I suggest you look at the models cited in the introduction of Vaswani et al 2017. This is the paper that heralded the move to attention-only, so the models they cite are the best examples of attention combined with something else.

Thanks for the suggestion.my thesis is sentiment analysis on objectivity. The text of the sentences is news and is a fact. I want to design a network that can understand semantic compositionality according to conjunction between them or more complex relationships in sequence.

That is, like your example: ("mary gave roses to susan " In a single self-attention operation, all this information just gets summed together. If Susan gave Mary the roses instead, the output vector 𝐲gave would be the same, even though the meaning has changed. ) According to this sentences, I want the polarity of the sentence "Susan attacked Mary" to be different from the sentence "Mary attacked Susan". Again, as you say, adding a position information as positional embedding can be necessary in this case. is it correct? So I want the network to learn more important combinations, and relationships using the multi-head layer after embedding. Then adding the Bi-LSTM network captures sequences. I have set various hyperparameters to get better results but the changes are very small. Given my implementation, I doubt this theory is wrong in practice. Can you help me?

pbloem commented 2 years ago

I'm not sure I can help you here. Everything you describe about word order, the LSTM can already do without position embeddings (or self attention). The position vectors are just there to benefit the self-attention. I see no reason to think that adding attention to an LSTM in this case would massively improve it, and again, such models have already been tried. I suggest you look at the literature to see what the best way is to combine RNNs and attention.