Excuse me, in your code, the sentence embedding is calculated by averaging the word embedding in the word_num dim. While, in BERT, the encoding corresponding to the '[CLS]' token can represent the whole sentence. Why not use this as sentence embedding? Which one performs better?
Excuse me, in your code, the sentence embedding is calculated by averaging the word embedding in the word_num dim. While, in BERT, the encoding corresponding to the '[CLS]' token can represent the whole sentence. Why not use this as sentence embedding? Which one performs better?