philippedeb / IN4325-project-corelR-6

IN4325 Group Project - corelR 6

https://brightspace.tudelft.nl/d2l/home/596319

1 stars 0 forks source link

Implement learning to rank #5

Open RDoting opened 8 months ago

RDoting commented 8 months ago

Depends on

2

3

4

BM25, cut off the first 100
Implement neural network to learn to rank

levichy commented 8 months ago

https://pyterrier.readthedocs.io/en/latest/ltr.html

philippedeb commented 8 months ago

http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html

philippedeb commented 8 months ago

https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html

RDoting commented 8 months ago

From the slides:

Practical tips for creating a LTR pipeline

Download an IR dataset. You can use a library to do this for you, for instance, ir-datasets
Calculate the model scores φ(q,d) for your IR collection (remember this will be your input X). Use a library for this step, like PyTerrier
Use another library to train a the L2R model (the machine learning model). like sklearn for pointwise learning and RankLib for pairwise and listwise learning. Split at least into train and test.
Use the trained model to predict on test. Calculate IR evaluation metrics, like PyTerrier or Pytrec_eval

RDoting commented 8 months ago

Pairwise ranking is empirically more robust and efficient

RDoting commented 8 months ago

Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user.

Calculating a score for each document

RDoting commented 8 months ago

First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as the vector space model, boolean model, weighted AND,[6] or BM25. This phase is called top- k {\displaystyle k} document retrieval and many heuristics were proposed in the literature to accelerate it, such as using a document's static quality score and tiered indexes.[7] In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents.

RDoting commented 8 months ago

By default, PyTerrier is configured for indexing and retrieval in English. See our notebook (colab) for details on how to configure PyTerrier in other languages.

Maybe look at the score of BM25 on certain languages, and if the score changes if you implement this

RDoting commented 8 months ago

https://colab.research.google.com/github/terrier-org/pyterrier/blob/master/examples/notebooks/non_en_retrieval.ipynb

RDoting commented 8 months ago

https://github.com/wis-delft/in4325-information-retrieval

RDoting commented 8 months ago

https://colab.research.google.com/github/terrier-org/pyterrier/blob/master/examples/notebooks/ltr.ipynb#scrollTo=5gCHuDiJMNJZ

RDoting commented 8 months ago

https://cli.github.com/manual/gh_repo_sync

RDoting commented 8 months ago

https://pyterrier.readthedocs.io/en/latest/datasets.html

RDoting commented 8 months ago

https://github.com/castorini/duobert

RDoting commented 8 months ago

Compare:

Scores between the different languages
Scores between only using BM25 and BM25 and ML (Learning to Rank)

RDoting commented 8 months ago

Original MIRACL paper uses k=1000