As discussed in A.3, training a and b does not seem to influence the performance. An intuition is as what you mentioned: "as the b_j values do not deviate significantly from their initial values" (How about a_j?). Do you have any theoretic evidence of why this is true? To my knowledge, in language processing tasks, they let the network learn also the embedding of each token, and it makes the performance better.
Is there a notebook to experiment with this (Figure 8)?
As discussed in A.3, training a and b does not seem to influence the performance. An intuition is as what you mentioned: "as the b_j values do not deviate significantly from their initial values" (How about a_j?). Do you have any theoretic evidence of why this is true? To my knowledge, in language processing tasks, they let the network learn also the embedding of each token, and it makes the performance better.
Is there a notebook to experiment with this (Figure 8)?