Obtaining low scores in the STS 2012 task

PedroMLF commented 6 years ago

Hi!

I've been making some experiments with sentence embeddings and using SentEval to obtain results in several tasks. In particular, I've been using the STS 2012 task. I'm opening this issue because using bert is yielding low scores:

ALL (weighted average) : Pearson 0.2364, Spearman: 0.3241 ALL (average): Pearson: 0.2863, Spearman: 0.3503

(As a matter of comparison, both InferSent and Google Universal Sentence Encoder yield between 0.60-0.65 for all of them.)

My approach:

I'm using extract_features.py to obtain the layer values for the top layer (the one noted as -1). Then, I use the vector obtained for the CLS token as the sentence embedding (following what's said in the paper, namely "In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding."). I'm using the Bert-Large Uncased model and thus, I lowercase all the sentences in the batcher function of SentEval. The corresponding code is:

def batcher(params, batch):

    # Translating empty lines into something else ([.])
    batch = [sent if sent != [] else ['.'] for sent in batch]

    # Create the output json
    batch_sents = [" ".join([w.lower() for w in sent]) for sent in batch]

    with open("temp_bert_in.txt", 'w') as rb: 
        for line in batch_sents:
            rb.write(line + "\n")

    init_time = time.time()
    subprocess.call(['bash', 'run_bert.sh'])
    print("Creating sent embeds took {:.2f} s".format(time.time()-init_time))

    # Parse the output json with the bert pre-trained embeddings
    json_list = [json.loads(line) for line in open("temp_bert_out.json", "r")]

    # Create the embeddings
    embed_dim = len(json_list[0]["features"][0]["layers"][0]["values"])
    embeddings = np.zeros((len(batch), embed_dim))

    for ix, sentence_json in enumerate(json_list):
        cls_emb = sentence_json["features"][0]["layers"][0]["values"]
        embeddings[ix] = cls_emb

    return embeddings.astype('float32')

The goal of this function is simply to return a matrix with the sentence embeddings for every sentence. The run_bert.sh script is just a way of easily calling the extract_features.py function.

I would like to know if I'm making some logical mistake and not using bert as intended, or if anyone can give me an intuition on why the scores might be so low. Thanks in advance.

hanxiao commented 5 years ago

note, without fine tuning [CLS] is not a good representation of the sentence. please check out my repo: https://github.com/hanxiao/bert-as-service which offers a fast and scalable way to extract features of sentences.

PedroMLF commented 5 years ago

Ok, I'll look into it. Thanks!

Edit. Even though using your suggested representation helped, using BERT pre-trained model straightaway ended up not being able to outperform other approaches (by a significant margin).

mvss80 commented 5 years ago

@PedroMLF can you share the scores you got for STS 2012? Did BERT perform better than InferSent or Google USE for any particular choice?

PedroMLF commented 5 years ago

These are the scores I obtained using SentEval:

screenshot from 2019-01-07 10-43-49

In my experiments, BERT did not outperform any of the above shown approaches on all of STS12/13/14/15/16. Results shown below.

screenshot from 2019-01-07 10-48-51

mvss80 commented 5 years ago

Thanks @PedroMLF - I too checked again to confirm by using mean pooling for layer -2 and means of layers -2,-3,-4,-5 and get results similar to yours.

@jacobdevlin-google any ideas why we are seeing such low numbers. Would you have expected the performance to be significantly better?

xinsu626 commented 5 years ago

@PedroMLF Hi, have you found the reason that using BERT embedding gives you poor performance? I also used bert sentence embedding for binary classification task, which performance is significantly lower than other approaches.

google-research / bert

Obtaining low scores in the STS 2012 task #128