Open kimborgen opened 1 year ago
Becauase of the pararell attention/MLP, the final output will have an "unprocessed" attention output. Can we increase performance by adding a final MLP layer?
Becauase of the pararell attention/MLP, the final output will have an "unprocessed" attention output. Can we increase performance by adding a final MLP layer?