Closed pommedeterresautee closed 4 years ago
Hey @pommedeterresautee I'm just a regular user , not an author of any of the amazing tools :)
Lol right, I made an error in my copy paste! (@nreimers ) Your participation to the discussion is still interesting :-)
Hi @pommedeterresautee Getting semantic search (i.e. mapping docs & queries to a dense vector and indexing those) right can be extremely challenging. You must ensure that the mapping doc/query -> vectors fits perfectly.
Also, you often experience an issue with a too high false positive rate, i.e., semantic search does only retrieve irrelevant matches.
Elasticsearch (BM25) on the other hand is great with a small false positive rate: If it finds a match, it is very likely that it is relevant to the query. However, BM25 can have an issue with recall, especially when the documents and querys are short.
So im summary: Semantic search: Great recall, bad precision BM25: Great precision, sometimes bad recall
In order to get semantic search with a good precision, you have to put in a lot of work, so that you get nicely designed vectors that work well for your use case.
The way Nboost takes is on the other hand much easier (and leads to more success in many cases): You increase the docs retrieved by BM25, than you filter them with a model.
So with approach you get BM25+Model (NBoost approach): Great precision, better recall
How important and how bad the recall is, depends on your task. If you index sentences and you search words that have many synonyms, and you expect that all relevant sentences are retrieved, than the BM25+Model would not yield any significant improvements. It still relies on a word-overlap to be able to find matching docs.
This requirement for word-overlap is not the case for semantic search. However, there you can have a big issue with finding too many non-relevant documents (false positives).
In conclusion: It depends completely on your task and data which approach is more promising. For usual use cases, BM25 + Model is currently the easier and better approach I would say.
Best Nils Reimers
The task I am working on is similarity search between documents having a length between 100 and 10000 characters (1 to few paragraphs, and most of the time more than one sentence). I have pairs of documents semantically related, they most of the time have large vocabulary gap (but it is not mandatory). For each positive pair I generate a random negative pair. Positive pair have a similarity score of 1 and negative pair a similarity score of 0. Dataset is 30K positive examples large with very little noise (manual random check).
I tried 2 strategies, S-BERT
and simple TF-IDF
(no search engine, I just built my own sparse matrix do neighbourhood search on it).
For S-BERT I use the CosineSimilarityLoss
. I tried Triplet loss in the pas but results was disappointing, since the very few batches the triplet loss provided perfect results... which make sense as the data are easy to guess. For CosineSimilarityLoss
it is harder as it's a kind of regression, but rapidly I tend to reach the max Pearson (0.99). Spearman top at 0.86.
My finding so far is that TF-IDF approach provides much better results most of the time. S-BERT results are very bad when doing a search just using generated vectors (cosine distance).
I also tried TF IDF and rerank top 100 with vectors and found that it doesn't brought any improvement (qualitative appreciation, no measure on this specific setup).
Regarding measures... TF IDF score is lower on test set than S-BERT:
Any idea why? how is it possible? I am sure I miss something obvious... That's why I am wondering if nboost may be a good choice.
closing discussion here.
@pommedeterresautee Hi, I am doing the same thing with the difference being that I am using a bigger dataset and BM25+S-BERT for ranking. I am retrieving the top 1000 documents using BM25 and then using Cosine Similarity to re-rank them using S-BERT embedding vectors. I found this re-ranking approach to be better than most of the approaches I had tried earlier including pairwise and listwise learning to rank. I will be experimenting with the objective functions and noting the results, but I was interested to know if your results improved using 'nboost'? Thanks.
I finally used transformers directly and results are very good, much better than anything tried before. Only pain point is that is slow, 1000 docs reranked with 256 tokens limit takes 5-6 seconds on a 2080
for speed boost take a look at onnx runtime. I've implemented it in nboost (although haven't had time to update readme). use --model_dir onnx-bert-base-uncased-msmarco
Got a 2x+ speed improvement
Hi @pommedeterresautee , When you mean used transformers directly did you calculate the embeddings of all your documents, index it and then search? Or did you use S-BERT/nboost only to re-rank the queries from BM25?
@pommedeterresautee What was your final setup? bertforsequenceclassification?
Hi,
I have read with great interest the discussion between @realsergii (one of the author of https://github.com/UKPLab/sentence-transformers) and @pertschuk (author of nboost) here.
I got that S-BERT task is harder because:
(from https://arxiv.org/pdf/1908.10084.pdf)
What I want to know is how big the difference is?
My understanding is that @pertschuk has run quite a lot of tests before starting this project, and I am wondering if we are speaking of a 5 / 10 / 20 / more points difference in relevancy measure (for instance)?
Thank you for all the info you can bring.
Kind regards, Michael