rafiepour / CTran

Complete code for the proposed CNN-Transformer model for natural language understanding.
https://github.com/rafiepour/CTran
Apache License 2.0
23 stars 2 forks source link

Why the addition of Self-Attention for Intent-Detection? #3

Closed annanfree closed 2 months ago

annanfree commented 3 months ago

why dont use a simple fully connected layer for intent classification but instead uses self attention ? How to explain the reason.

rafiepour commented 2 months ago

During our experiments, we found out that adding an additional self-attention layer increases the performance for the other co-training task. We hypothesize that distancing the gradients of downstream tasks from the shared encoder increased the performance. There was no initial intuition behind this addition, just trial and error.

Here is some explanations behind the term "distancing the gradients":

Distancing Gradients: This refers to techniques used to reduce the interference or interaction between gradients from different tasks or components of a model. In multi-task learning, different tasks might share a common encoder (a part of the network that processes input data into a form that is useful for different tasks). If the gradients from different tasks interfere with each other during backpropagation, it can harm the overall performance. "Distancing the gradients" might imply using methods to mitigate this interference.