Closed JorgeGabin closed 5 months ago
Thanks @JorgeGabin, this is an excellent spot. Should we have a unit test on this, e.g. that ensures p-values for correction="hommel"
are not NaN?
@seanmacavaney can double check this for sanity?
Yes, I will add a unit test for that. What do you think about updating the existing test_baseline_corrected
one to something like this? @cmacdonald
def test_baseline_corrected(self):
dataset = pt.get_dataset("vaswani")
res1 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")(dataset.get_topics().head(10))
res2 = pt.BatchRetrieve(dataset.get_index(), wmodel="DPH")(dataset.get_topics().head(10))
baseline = 0
for corr in ['hs', 'bonferroni', 'hommel']:
df = pt.Experiment(
[res1, res2],
dataset.get_topics().head(10),
dataset.get_qrels(),
eval_metrics=["map", "ndcg"],
baseline=baseline, correction=corr)
self.assertTrue("map +" in df.columns)
self.assertTrue("map -" in df.columns)
self.assertTrue("map p-value" in df.columns)
self.assertTrue("map p-value corrected" in df.columns)
self.assertTrue("map reject" in df.columns)
self.assertFalse(any(df["map p-value corrected"].drop(baseline).isna()))
This way now several methods are checked and also the check you suggested is added.
Older code here:
def test_baseline_corrected(self):
dataset = pt.get_dataset("vaswani")
res1 = pt.BatchRetrieve(dataset.get_index(), wmodel="BM25")(dataset.get_topics().head(10))
res2 = pt.BatchRetrieve(dataset.get_index(), wmodel="DPH")(dataset.get_topics().head(10))
for corr in ['hs', 'bonferroni', 'holm-sidak']:
df = pt.Experiment(
[res1, res2],
dataset.get_topics().head(10),
dataset.get_qrels(),
eval_metrics=["map", "ndcg"],
baseline=0, correction='hs')
self.assertTrue("map +" in df.columns)
self.assertTrue("map -" in df.columns)
self.assertTrue("map p-value" in df.columns)
self.assertTrue("map p-value corrected" in df.columns)
self.assertTrue("map reject" in df.columns)
Unit test: Sounds good to me.
Are you sure df.drop(int) drops equivalent to iloc or equivalent to index label? They might be equivalent here anyway, ie the dataframe has integer labels.
df.drop(int)
drops by index label, in fact, it is equivalent here because the results dataframe uses the default index, but may we change it so that we drop the row based on the iloc?
Yes, I'd prefer dropping by iloc, for safety. baseline=0 is an iloc expression.
As iloc
is deprecated (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) I implemented this by getting the index label using baseline
as the position, I think it should be equivalent to what we were talking about.
I think it should be equivalent to what we were talking about.
Perfect.
Finally, can we document the impact here for more common correction methods like HB and B? Just rerunning your code and reporting the old vs new tables with corrected p-values
Craig
I think it should be equivalent to what we were talking about.
Perfect.
Finally, can we document the impact here for more common correction methods like HB and B? Just rerunning your code and reporting the old vs new tables with corrected p-values
Craig
Yes, I will add these extra examples in the first comment so that they are more visible
Impact assessment, e.g. for bonferroni: it was harder to be significant when denominator is (n+1) instead of n, for n comparisons to the baseline. So some things that used to not be significant after correction would now be significant.
Merged and new release made, thank you @JorgeGabin
Issue: When performing the significance test correction the p-value for the baseline (NaN) is included in the p-values list, therefore corrected p-values are not right.
Code to reproduce the error:
Output:
Expected output:
This example is clarifying as we are using Hommel as the correction method which does not allow NaN values, however, it applies to other corrections as well.
Proposed solution: I propose dropping the baseline p-value from the dataframe column which is used as the input for the
multipletests
function and then inserting default values on thereject
andcorrected
ndarrays for it.Additional experiments: Holm-Sidak Output:
Expected output:
Bonferroni Output:
Expected output:
Holm Output:
Expected output: