terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
412 stars 65 forks source link

Delete baseline pvalue from correction method input #440

Closed JorgeGabin closed 5 months ago

JorgeGabin commented 5 months ago

Issue: When performing the significance test correction the p-value for the baseline (NaN) is included in the p-values list, therefore corrected p-values are not right.

Code to reproduce the error:

import pyterrier as pt 

from pyterrier.measures import *
from pyterrier_pisa import PisaIndex

pt.init()

dataset = pt.get_dataset('irds:msmarco-passage/trec-dl-2019/judged')

index = PisaIndex.from_dataset('msmarco_passage')
bm25 = index.bm25(num_results=1000)
dph = index.dph()
pl2 = index.pl2(c=1.0)

res = pt.Experiment(
    [bm25, dph, pl2],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=[RR(rel=2)@10], # type: ignore
    batch_size=1024,
    drop_unused=True,
    names = ["bm25", "dph", "pl2"],
    round=4,
    baseline=0,
    correction="hommel",
)

Output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                             NaN
2   pl2        0.6452             5.0            11.0              0.433652                False                             NaN

Expected output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                         0.51293
2   pl2        0.6452             5.0            11.0              0.433652                False                         0.51293

This example is clarifying as we are using Hommel as the correction method which does not allow NaN values, however, it applies to other corrections as well.

Proposed solution: I propose dropping the baseline p-value from the dataframe column which is used as the input for the multipletests function and then inserting default values on the reject and corrected ndarrays for it.

Additional experiments: Holm-Sidak Output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                        0.818344
2   pl2        0.6452             5.0            11.0              0.433652                False                        0.818344

Expected output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                         0.67925
2   pl2        0.6452             5.0            11.0              0.433652                False                         0.67925

Bonferroni Output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                             1.0
2   pl2        0.6452             5.0            11.0              0.433652                False                             1.0

Expected output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                        1.000000
2   pl2        0.6452             5.0            11.0              0.433652                False                        0.867304

Holm Output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                             1.0
2   pl2        0.6452             5.0            11.0              0.433652                False                             1.0

Expected output:

   name  RR(rel=2)@10  RR(rel=2)@10 +  RR(rel=2)@10 -  RR(rel=2)@10 p-value  RR(rel=2)@10 reject  RR(rel=2)@10 p-value corrected
0  bm25        0.6780             NaN             NaN                   NaN                False                             NaN
1   dph        0.6624             6.0             8.0              0.512930                False                        0.867304
2   pl2        0.6452             5.0            11.0              0.433652                False                        0.867304
cmacdonald commented 5 months ago

Thanks @JorgeGabin, this is an excellent spot. Should we have a unit test on this, e.g. that ensures p-values for correction="hommel" are not NaN?

@seanmacavaney can double check this for sanity?

JorgeGabin commented 5 months ago

Yes, I will add a unit test for that. What do you think about updating the existing test_baseline_corrected one to something like this? @cmacdonald

  def test_baseline_corrected(self):
      dataset = pt.get_dataset("vaswani")
      res1 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")(dataset.get_topics().head(10))
      res2 = pt.BatchRetrieve(dataset.get_index(), wmodel="DPH")(dataset.get_topics().head(10))
      baseline = 0
      for corr in ['hs', 'bonferroni', 'hommel']:            
          df = pt.Experiment(
              [res1, res2], 
              dataset.get_topics().head(10), 
              dataset.get_qrels(),
              eval_metrics=["map", "ndcg"], 
              baseline=baseline, correction=corr)
          self.assertTrue("map +" in df.columns)
          self.assertTrue("map -" in df.columns)
          self.assertTrue("map p-value" in df.columns)
          self.assertTrue("map p-value corrected" in df.columns)
          self.assertTrue("map reject" in df.columns)
          self.assertFalse(any(df["map p-value corrected"].drop(baseline).isna()))

This way now several methods are checked and also the check you suggested is added.

Older code here:

  def test_baseline_corrected(self):
      dataset = pt.get_dataset("vaswani")
      res1 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")(dataset.get_topics().head(10))
      res2 = pt.BatchRetrieve(dataset.get_index(), wmodel="DPH")(dataset.get_topics().head(10))
      for corr in ['hs', 'bonferroni', 'holm-sidak']:            
          df = pt.Experiment(
              [res1, res2], 
              dataset.get_topics().head(10), 
              dataset.get_qrels(),
              eval_metrics=["map", "ndcg"], 
              baseline=0, correction='hs')
          self.assertTrue("map +" in df.columns)
          self.assertTrue("map -" in df.columns)
          self.assertTrue("map p-value" in df.columns)
          self.assertTrue("map p-value corrected" in df.columns)
          self.assertTrue("map reject" in df.columns)
cmacdonald commented 5 months ago

Unit test: Sounds good to me.

Are you sure df.drop(int) drops equivalent to iloc or equivalent to index label? They might be equivalent here anyway, ie the dataframe has integer labels.

JorgeGabin commented 5 months ago

df.drop(int) drops by index label, in fact, it is equivalent here because the results dataframe uses the default index, but may we change it so that we drop the row based on the iloc?

cmacdonald commented 5 months ago

Yes, I'd prefer dropping by iloc, for safety. baseline=0 is an iloc expression.

JorgeGabin commented 5 months ago

As iloc is deprecated (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) I implemented this by getting the index label using baseline as the position, I think it should be equivalent to what we were talking about.

cmacdonald commented 5 months ago

I think it should be equivalent to what we were talking about.

Perfect.

Finally, can we document the impact here for more common correction methods like HB and B? Just rerunning your code and reporting the old vs new tables with corrected p-values

Craig

JorgeGabin commented 5 months ago

I think it should be equivalent to what we were talking about.

Perfect.

Finally, can we document the impact here for more common correction methods like HB and B? Just rerunning your code and reporting the old vs new tables with corrected p-values

Craig

Yes, I will add these extra examples in the first comment so that they are more visible

cmacdonald commented 5 months ago

Impact assessment, e.g. for bonferroni: it was harder to be significant when denominator is (n+1) instead of n, for n comparisons to the baseline. So some things that used to not be significant after correction would now be significant.

cmacdonald commented 5 months ago

Merged and new release made, thank you @JorgeGabin