sebastian-hofstaetter / matchmaker

Training & evaluation library for text-based neural re-ranking and dense retrieval models built with PyTorch
https://neural-ir-explorer.ec.tuwien.ac.at/
Apache License 2.0
261 stars 30 forks source link

Pointers? #1

Closed pommedeterresautee closed 4 years ago

pommedeterresautee commented 5 years ago

Hi,

Just discovered your code. Seems super interesting, in particular the TK model.

I am wondering if there is a paper on TK model? May be some scores and info on speed? (I have found nothing on the MsMarco leaderboard).

My main question: is it much more powerful than CKNRM and light enough to be usable in real scenario (not taking minutes to rerank candidates).

sebastian-hofstaetter commented 5 years ago

Hi, Thank you for the fast interest!

I will present a tech report at TREC next week and make it available then. (Including updating the readme of this repo)

As a teaser: TK is 50 times faster than BERT-base and substantially better than CONV-KNRM across multiple collections. And TK is analyzable by design.

Best, Sebastian

pommedeterresautee commented 5 years ago

great teaser indeed! Can't wait to check. Just a last question before waiting for 1 week for more info, 50 times faster than Bert still mean 1-2 secs per query, right?

Just an advice btw FWIW, conv-knrm is supposed to be on par with Bert on ad hoc search (a different task by nature than msmarco where contextual understanding of words makes sense). May be in the future you want to add a benchmark on ad hoc search too. Here for more info on Bert Vs CKNRM: https://arxiv.org/pdf/1904.07531.pdf

pommedeterresautee commented 5 years ago

Hi, want to let you know that I am very impatient, I tried TK V1 on my ad hoc search logs (I work for a legal publisher). Results are similar to those I get from CKNRM (almost the same score). I waited for 2 epochs (no more improvements). I have seen in your config file that you use TK V6, so I will wait to check :-) (the code of TK V6 is not yet available in the repo). However perf are much better to what I was expecting (for inference on a recent GPU)

sebastian-hofstaetter commented 5 years ago

That's very interesting - are the neural models better than bm25? in general there are many things we can do to tune the models, so for example how long are the documents/passages you are using?

pommedeterresautee commented 5 years ago

Yep they are both 10 points better than a simple Bm25 on P@5. However I rerank only top 20 results (1 page of SERP), and we think our in prod system is far from being optimized (lots of SOLR boosters on some words not needed, etc.). So BM25 score is BM25 applied to the 20 docs of each SERP (it gives better results that our prod sys). I can also say that other reranker don't reach those results, they are not even on par with Bm25. I use raw clicks. I added a little modif to both models to modelize bias position. It gives a little boost to both models but is of no need on MsMarco (AFAI it s already debiased by a click model).

I tried different snippet size. Best results with 5 words on left and 5 on right of each matching word. Measures are slightly lower for other values, and very low when using only matching words (important to check if expected behavior happens). Full title is always used (I use 2 text fields, signals are separated for best perf, which required a little change in your model of course).

sebastian-hofstaetter commented 4 years ago

Hi, Sorry for answering so late. I uploaded our TREC tech report on arxiv now (https://arxiv.org/abs/1912.01385). I would try a bigger re-ranking depth than 20 (in core_metrics.py there are methods to automatically evaluate all possible re-ranking depths at once with numpy) and also maybe you could try full documents instead of snippets (there is a tk_v2 in tk.py which uses windowed kernel pooling for longer document inputs, and it did pretty well in the TREC document ranking task)