Open PedroMLF opened 6 years ago
note, without fine tuning [CLS] is not a good representation of the sentence. please check out my repo: https://github.com/hanxiao/bert-as-service which offers a fast and scalable way to extract features of sentences.
Ok, I'll look into it. Thanks!
Edit. Even though using your suggested representation helped, using BERT pre-trained model straightaway ended up not being able to outperform other approaches (by a significant margin).
@PedroMLF can you share the scores you got for STS 2012? Did BERT perform better than InferSent or Google USE for any particular choice?
These are the scores I obtained using SentEval:
In my experiments, BERT did not outperform any of the above shown approaches on all of STS12/13/14/15/16. Results shown below.
Thanks @PedroMLF - I too checked again to confirm by using mean pooling for layer -2 and means of layers -2,-3,-4,-5 and get results similar to yours.
@jacobdevlin-google any ideas why we are seeing such low numbers. Would you have expected the performance to be significantly better?
@PedroMLF Hi, have you found the reason that using BERT embedding gives you poor performance? I also used bert sentence embedding for binary classification task, which performance is significantly lower than other approaches.
Hi!
I've been making some experiments with sentence embeddings and using SentEval to obtain results in several tasks. In particular, I've been using the STS 2012 task. I'm opening this issue because using bert is yielding low scores:
ALL (weighted average) : Pearson 0.2364, Spearman: 0.3241 ALL (average): Pearson: 0.2863, Spearman: 0.3503
(As a matter of comparison, both InferSent and Google Universal Sentence Encoder yield between 0.60-0.65 for all of them.)
My approach:
I'm using
extract_features.py
to obtain the layer values for the top layer (the one noted as -1). Then, I use the vector obtained for theCLS
token as the sentence embedding (following what's said in the paper, namely "In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding."). I'm using theBert-Large Uncased
model and thus, I lowercase all the sentences in thebatcher
function of SentEval. The corresponding code is:The goal of this function is simply to return a matrix with the sentence embeddings for every sentence. The
run_bert.sh
script is just a way of easily calling theextract_features.py
function.I would like to know if I'm making some logical mistake and not using bert as intended, or if anyone can give me an intuition on why the scores might be so low. Thanks in advance.