immunogenomics / presto

Fast Wilcoxon and auROC
167 stars 35 forks source link

I digged into the code to see why I cannot reproduce stats::wilcox.test() result #29

Open mvfki opened 9 months ago

mvfki commented 9 months ago

Hi team,

Thanks for the incredibly fast implementation. I have been trying to clone the repo and learn from the code. When I was trying to extend it from a default two-sided hypothesis to a one-sided hypothesis, I found something potentially mistaken.

For the hypothesis extension, it's rather simple by switching a correction value and pnorm() call in compute_pval(), referring to R's implementation. However, when I'm testing it with a sparse matrix where one of the features/genes is all zero, I found the result pretty odd -- It got a <1 p-value! So I digged into the code and suspected the ties returned from Rcpp side lacks the counting on zeros, which I believe is intentional for sparse data. I can see that the ties on zeros are easily addressed when calculating the ustats, but I never found it even mentioned when you calculate p-values. With suspecting that this is the cause, I modified my the Rcpp code in my copy:

src/fast_wilcox.cpp function cpp_rank_matrix_dgc

std::vector<std::list<float> > cpp_rank_matrix_dgc(
        arma::vec& x, const arma::vec& p, int nrow, int ncol) {
    vector<list<float> > ties(ncol);
    int n_zero;
    for (int i = 0; i < ncol; i++) {
        n_zero = nrow - (p[i+1] - p[i]);
        if (p[i+1] == p[i]) {
            ties[i].push_back(n_zero);
            continue;
        }
        ties[i] = cpp_in_place_rank_mean(x, p[i], p[i + 1] - 1);
        ties[i].push_back(n_zero);
        x.rows(p[i], p[i + 1] - 1) += n_zero;
    }
    return ties;
}

and also in function cpp_in_place_rank_mean, it seems the lastly counted ties are also not pushed back as well.

// code originally between line 125-143
    for (i = 1U; i < v_sort.size(); i++) {
        if (v_sort[i].first != v_sort[i - 1].first) {
            // .....
        } else {
            // .....
        }
    }
    // Things look good until now but I guess we need to add the following line 
    // right after the loop ends
    if (n > 1) ties.push_back(n);

It's very easy to test with something like

vec <- rep(0:2, 5)
y <- factor(rep(c(1,2), c(7,8)))
sparse <- as(t(vec), "CsparseMatrix")
presto::wilcoxauc(sparse, y)
   feature group   avgExpr      logFC statistic       auc      pval      padj   pct_in  pct_out
1 Feature1     1 0.8571429 -0.2678571        23 0.4107143 0.5888988 0.5888988 57.14286 75.00000
2 Feature1     2 1.1250000  0.2678571        33 0.5892857 0.5888988 0.5888988 75.00000 57.14286

wilcox.test(vec[1:7], vec[8:15])$p.value
[1] 0.581541

And if I apply the modification I described above, we can then have

wilcoxauc(sparse, y)
   feature group   avgExpr      logFC statistic       auc     pval     padj   pct_in  pct_out
1 Feature1     1 0.8571429 -0.2678571        23 0.4107143 0.581541 0.581541 57.14286 75.00000
2 Feature1     2 1.1250000  0.2678571        33 0.5892857 0.581541 0.581541 75.00000 57.14286

OK, all of these are based on that I want the function to be able to replicate what stats::wilcox.test() can produce. If it was on purpose to ignore the ties on zeros and the last happening ties when calculating p-values, I would really love to learn about the reason. But overall I appreciate it so much that you have made this package!

-Yichen