Why the addition of Self-Attention for Intent-Detection?

During our experiments, we found out that adding an additional self-attention layer increases the performance for the other co-training task. We hypothesize that distancing the gradients of downstream tasks from the shared encoder increased the performance. There was no initial intuition behind this addition, just trial and error.

Here is some explanations behind the term "distancing the gradients":

Distancing Gradients: This refers to techniques used to reduce the interference or interaction between gradients from different tasks or components of a model. In multi-task learning, different tasks might share a common encoder (a part of the network that processes input data into a form that is useful for different tasks). If the gradients from different tasks interfere with each other during backpropagation, it can harm the overall performance. "Distancing the gradients" might imply using methods to mitigate this interference.

rafiepour / CTran

Why the addition of Self-Attention for Intent-Detection? #3