terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
412 stars 65 forks source link

RM3 does not add additional terms to the query for very small corpora #407

Closed mam10eks closed 1 month ago

mam10eks commented 11 months ago

Dear all,

Thank you very much for your efforts with PyTerrier, it is a really awesome framework! I currently want to create small Hello World examples to showcase some concepts for our IR course, and it looks to me like RM3 does not behave as I would expect in my example, because RM3 does not expand the query (maybe I used it wrong, or there are some caveats because my example is so small). This is not urgent, as it works with Bo1QueryExpansion and KLQueryExpansion.

Describe the bug RM3 does not add additional terms to the query in the hello world example described below. E.g., For the same relevance feedback document, RM3 expands the query dog to applypipeline:off dog^0.600000024 whereas Bo1 expanded the query dog to applypipeline:off dog^2.000000000 colli^1.000000000 3^1.000000000 border^1.000000000 shepherd^1.000000000 1^1.000000000 type^1.000000000 german^1.000000000 2^1.000000000 poodl^0.805050646.

To Reproduce

I have prepared a Colab notebook that showcases the problem.

Steps to reproduce the behavior:

  1. Which index
documents = [
    {'docno': 'd1', 'text': 'The Golden Retriever is a Scottish breed of medium size.'},
    {'docno': 'd2', 'text': 'Intelligent types of dogs are: (1) Border Collies, (2) Poodles, and (3) German Shepherds.'},
    {'docno': 'd3', 'text': 'Poodles are a highly intelligent, energetic, and sociable.'},
    {'docno': 'd4', 'text': 'The European Shorthair is medium-sized to large cat with a well-muscled chest.'},
    {'docno': 'd5', 'text': 'The domestic canary is a small songbird.'}
]

if not pt.started():
    pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])

indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, blocks=True)
index_ref = indexer.index(documents)
index = pt.IndexFactory.of(index_ref)
  1. Which topics
topic = pd.DataFrame([{'qid': '1', 'query': 'dog'}])
  1. Which pipeline
explicit_relevance_feedback = pt.Transformer.from_df(pd.DataFrame([{'query': 'dog', 'qid': '1', 'docno': 'd2'}]))

bo1_expansion = explicit_relevance_feedback >> pt.rewrite.Bo1QueryExpansion(index)
rm3_expansion = explicit_relevance_feedback >> pt.rewrite.RM3(index)
  1. Output

Screenshot_20231022_104422

Expected behavior

I would expect that RM3 also adds terms like colli, shepherd, poodl, etc to the query.

Documentation and Issues

Thanks in advance!

Best regards,

Maik

cmacdonald commented 11 months ago

Hi @mam10eks. Thanks for the report. I had time today to figure this out.

Firstly, adding pt.logging('INFO') makes RM3 a little bit more verbose. It wasnt quite verbose enough for me, so I added some more logging.

Like that, I got the output:

17:14:05.015 [main] WARN org.terrier.querying.RM1 - Did not identify any usable candidate expansion terms from docid 1
17:14:05.020 [main] INFO org.terrier.querying.RM1 - Found 0 terms after feedback document analysis

If I add the following properties, I get:

pt.set_property("prf.mindf", "1")
pt.set_property("prf.maxdp", "1")

I get:

17:14:45.730 [main] INFO org.terrier.querying.RM1 - Analysing 1 feedback documents
17:14:45.749 [main] INFO org.terrier.querying.RM1 - Found 11 terms after feedback document analysis
17:14:45.755 [main] INFO org.terrier.querying.RM3 - Reformulated query q100 @ lambda=0.6: intellig^0.03999999538064003 german^0.03999999538064003 colli^0.03999999538064003 2^0.03999999538064003 1^0.03999999538064003 border^0.03999999538064003 dog^0.64000004529953 3^0.03999999538064003 type^0.03999999538064003 poodl^0.03999999538064003

I think the overall lesson here is that sometimes small corpora do not exhibit the necessary statistics that tuned models expect.

My dev notebook is at: https://colab.research.google.com/drive/1BKLNZUTc9DidgHXM8L4o7vPXNEW9PpOO?usp=sharing

HTH

cmacdonald commented 11 months ago

On a related point, RM3 (unlike Bo1) is intended to examine the score of the retrieved documents. I dont think the score is appropriately passed to RM3 by PyTerrier:

See https://github.com/terrier-org/pyterrier/blob/master/pyterrier/rewrite.py#L223 and https://github.com/terrier-org/pyterrier/blob/master/pyterrier/rewrite.py#L239. Would you able to verify, and see the impact on Robust04 performance when fixing this?

mam10eks commented 11 months ago

Dear Craig,

Thanks for your response! Indeed, I was not aware of the prf.mindf and the prf.maxdp properties, and setting them resolves the problem in my tiny hello world example.

I will have a look into forwarding the retrieval scores to RM3 and will report back the impact on Robust04.

Best regards,

Maik

cmacdonald commented 10 months ago

@mam10eks any news?

mam10eks commented 10 months ago

Hi, I have not yet started with this, but at the beginning of next week I have time to look into this :)

cmacdonald commented 7 months ago

@mam10eks any news?

cmacdonald commented 5 months ago

@mam10eks any news? I'd like to fix this for the next PyTerrier release.

mam10eks commented 5 months ago

Dear @cmacdonald sorry again for the delay!

I have things for the upcoming CLEF deadline on my agenda until around friday, this means I could have first results here on monday or tuesday. Would this still suffice?

Best regards,

Maik

mam10eks commented 1 month ago

Dear all (cc @cmacdonald, @seanmacavaney),

Sorry for the super long delay!

I now have a pull request in preparation. Reporting retrieval scores on Robust04 as discussed above is still in progress, but I would already like to share my current solution so that we can look if this looks plausible.

The commits above should provide a solution, only three lines in the core code change, the rest are only unit tests. I.e., the main commit is this one, but I prepared some unit tests before that to ensure that I do not break other things and that the changes in the expansion weights look reasonable.

Overall, what I did was:

  1. As the code that we need to change uses the terrier-prf plugin, the corresponding unit test was commented out because it can not run during the main test. Therefore, I created a new file tests/test_rewrite_rm3.py that contains those tests but skips all tests if terrier-prf is not on the boot classpath. I.e., if the complete test suite is executed, those tests should be skipped, but when this file is executed in isolation, the tests are executed. I.e., running pytest tests/test_rewrite_rm3.py runs the unit tests, so that the cases that I did test are at least preserved. This is captured in this c541814 commit].
  2. I added a unit test showing that the RM3 expansions do not incorporate the retrieval scores, i.e., for two pipelines, one with TF-IDF, one with BM25, the expansion was the same, this was in this 2dceb1d commit.
  3. The code that we need to change is in the QueryExpansion class of PyTerrier, hence, I added approval tests for the other query expansion methods to ensure that they do not change in this c5e9d03 commit
  4. Finally, a possible solution was implemented in the 2701e28 commit that shows the changes in the production code and that the expansion queries now differ between TF-IDF and BM25 in the corresponding unit tests.

I will tomorrow report the effectiveness scores on the RM3 variants before and after my change on Robust04, but I think the code should already be finalized so that someone could already provide feedback on that end. This is the in progress pull request.

Best regards,

Maik

cmacdonald commented 1 month ago

Thanks @mam10eks, we'll give a look at the code in the PR

mam10eks commented 1 month ago

Ah, I see, one unit test fails, I only tested the new test class, I will modify the code so that the unit test works :)

mam10eks commented 1 month ago

The problem is in this test that does not prepare a score for BO1 expansion.

I think RM3 is the only query expansion method that incorporates the retrieval score, or?

In this case, it might be a solution to modify the code in _populate_resultset to use 0 as score if no scores are provided and to throw an error in RM3 if no scores are provided for RM3, or?

cmacdonald commented 1 month ago

Can we have a requires_scores boolean flag in the super.__init__(), defaults to false; set by RM3 to be true?

(this would be read by _populate_resultset)

mam10eks commented 1 month ago

Yeah, makes perfect sense, I will add this.

mam10eks commented 1 month ago

I added the requires_scores flag, all checks pass now.

I also did the experiments on Robust04, the impact on Recall@100 and Recall@1000 is negligible, but nDCG@3 and nDCG@10 improve slightly.

Here is a screenshot of the evaluation before my changes (commit by Craig): rm3_effectiveness_before_fix

Here is a screenshot of the evaluation after my changes (commit by myself): rm3_effectiveness_after_fix

mam10eks commented 1 month ago

I marked the corresponding pull request as ready for review, please let me know if something is still missing or should be changed :)

cmacdonald commented 1 month ago

Great to see improvements in performance!

cmacdonald commented 1 month ago

Now fixed, thanks Maik!