Closed steveguang closed 3 years ago
Hi @steveguang, sorry for the delay. So, if you have a sequence of word embeddings and want to compute a sentence embedding there are a few things you can do
Generally, I would advise for using the simple averaging strategy as it costs nothing and is generally good enough. But if you have enough training data you may try adding LSTM or CNN layers on top as these would require training from scratch.
However, since you are using a variant of BERT, if you format your input as [CLS] token_1 token_2 ... [SEP]
then you can use the pooler layer output (which transforms the output embedding of the [CLS] token) as a feature vector for your entire input sequence.
My general intuition would be that using the pooler output would work better for classification tasks and using the average of the token embeddings would work better for similarity tasks. But you'll need to check and see 😊
Hi, @helboukkouri , So from the example, I get embeddings for each token. I am thinking to get the representation of a whole sentence for downstream tasks, such as sentence similarity, classification, etc. My idea is to use word embeddings directly with a lstm layer to represent a whole sentence. Do you think this is the right way? Thanks!