Open RDoting opened 8 months ago
From the slides:
Practical tips for creating a LTR pipeline
Pairwise ranking is empirically more robust and efficient
Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user.
Calculating a score for each document
First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as the vector space model, boolean model, weighted AND,[6] or BM25. This phase is called top- k {\displaystyle k} document retrieval and many heuristics were proposed in the literature to accelerate it, such as using a document's static quality score and tiered indexes.[7] In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents.
By default, PyTerrier is configured for indexing and retrieval in English. See our notebook (colab) for details on how to configure PyTerrier in other languages.
Maybe look at the score of BM25 on certain languages, and if the score changes if you implement this
Compare:
Original MIRACL paper uses k=1000
Depends on
2
3
4