Conceptual question about similarity ranking on example8.py

boxabirds commented 5 years ago

Hi I have this demo running really well and I want to thank you so much @hanxiao for creating this marvellous project. My question is more conceptual and to do with the scoring method employed, so a python / np / vector arithmetic question, and so I apologise for such a basic question here. The real question is, with sentence similarity, what's the best way to detect that:

the query sentence is not similar to anything in the model?
that the query sentence is an exact match?

The algorithm used in example8.py is:

np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1)

I'm trying to figure out how to interpret this calculation with respect to the two objectives above. Thoughts most welcome!

I tried it with a different approach (apologies I really a n00b here I haven't yet figured out how to vectorise this algo):

   score = []
    # TODO vectorise this yeah I suck sorry
    for doc in doc_vecs:
      cosine_similarity = np.dot(query_vec,doc)/(np.linalg.norm(query_vec)*np.linalg.norm(doc))
      score.append(cosine_similarity)

This solved (2) because the result was 1.0000 in that case. But not solved for (1). The incredibly helpful FAQ has a comment on cosine similarity comparisons, and yes they're very high all the time (though I managed to get queries as low as 0.7 by typing random emojis and stuff), so the question becomes whether there is a better way to detect (1) (that the query does not match any results).

chikubee commented 5 years ago

There is a difference in the calculation and hence the scores, you can change the normalization in the query to np.sum(query_vec doc_vecs, axis=1) / (np.linalg.norm(doc_vecs, axis=1) np.linalg.norm(query_vec)) This query essentially does the same thing as what you are doing, with the only difference in normalization.

Both your exclaimed cases [1,2] can be inferred from either. I do not understand what you are really trying to mean when you say similar query sentences, I am adding an attachment for your reference. Hope this helps.

jobergum commented 4 years ago

According to https://arxiv.org/abs/1904.07531 using cosine similarity of the sentence embedding vectors is not great for ranking similarity. Quote from the paper

BERT (Rep) uses BERT to represent q and d: BERT (Rep)(q,d) = cos(q,d)

It first uses the last layers’ “[CLS]” embeddings as the query and document representations, and then calculates the ranking score via their cosine similarity (cos). Thus it is a representation-based ranker

BERT (Rep) applies BERT on the query and document individually and discard these cross sequence interactions, and its performance is close to random. BERT is an interaction-based matching model and is not suggested to be used as a representation model.

boxabirds commented 4 years ago

@jobergum thanks! Very insightful. Though I don't know where this puts my thinking with respect to BERT getting SOTA on semantic text similarity (STS) leaderboards? It's quite a thing for it to say it's an interaction-based model and not a representation model. What do you think about that? Every example I've seen assumes it does have some kind of representation of words and their context, and there's no temporal element I'm aware of.

Devlin's comment is really interesting, it does challenge some assumptions I personally was making about the utility of these embeddings, and in fact, bert-as-service I think:

"I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally)."

jobergum commented 4 years ago

See https://arxiv.org/abs/1910.14424 for details on how to achieve a top position on e.g MS MARCO with BERT using it as an interaction model meaning you need to encode both query and potential relevant text sentence/passage at the same time and not independently.

jina-ai / clip-as-service

Conceptual question about similarity ranking on example8.py #405