Open LexaNagiBator228 opened 3 years ago
Only the first token is used for classification. Please refer to the paper.
Only the first token is used for classification. Please refer to the paper.
I believe in the paper they use average polling. Of course using only 1st token still might provide you with great results, but using information from all 16 tokens should be better
In 228, why do you use only first token for classification?