Thanks for the incredibly fast implementation. I have been trying to clone the repo and learn from the code. When I was trying to extend it from a default two-sided hypothesis to a one-sided hypothesis, I found something potentially mistaken.
For the hypothesis extension, it's rather simple by switching a correction value and pnorm() call in compute_pval(), referring to R's implementation. However, when I'm testing it with a sparse matrix where one of the features/genes is all zero, I found the result pretty odd -- It got a <1 p-value! So I digged into the code and suspected the ties returned from Rcpp side lacks the counting on zeros, which I believe is intentional for sparse data. I can see that the ties on zeros are easily addressed when calculating the ustats, but I never found it even mentioned when you calculate p-values. With suspecting that this is the cause, I modified my the Rcpp code in my copy:
src/fast_wilcox.cpp function cpp_rank_matrix_dgc
std::vector<std::list<float> > cpp_rank_matrix_dgc(
arma::vec& x, const arma::vec& p, int nrow, int ncol) {
vector<list<float> > ties(ncol);
int n_zero;
for (int i = 0; i < ncol; i++) {
n_zero = nrow - (p[i+1] - p[i]);
if (p[i+1] == p[i]) {
ties[i].push_back(n_zero);
continue;
}
ties[i] = cpp_in_place_rank_mean(x, p[i], p[i + 1] - 1);
ties[i].push_back(n_zero);
x.rows(p[i], p[i + 1] - 1) += n_zero;
}
return ties;
}
and also in function cpp_in_place_rank_mean, it seems the lastly counted ties are also not pushed back as well.
// code originally between line 125-143
for (i = 1U; i < v_sort.size(); i++) {
if (v_sort[i].first != v_sort[i - 1].first) {
// .....
} else {
// .....
}
}
// Things look good until now but I guess we need to add the following line
// right after the loop ends
if (n > 1) ties.push_back(n);
OK, all of these are based on that I want the function to be able to replicate what stats::wilcox.test() can produce. If it was on purpose to ignore the ties on zeros and the last happening ties when calculating p-values, I would really love to learn about the reason. But overall I appreciate it so much that you have made this package!
Hi team,
Thanks for the incredibly fast implementation. I have been trying to clone the repo and learn from the code. When I was trying to extend it from a default two-sided hypothesis to a one-sided hypothesis, I found something potentially mistaken.
For the hypothesis extension, it's rather simple by switching a correction value and
pnorm()
call incompute_pval()
, referring to R's implementation. However, when I'm testing it with a sparse matrix where one of the features/genes is all zero, I found the result pretty odd -- It got a <1 p-value! So I digged into the code and suspected theties
returned from Rcpp side lacks the counting on zeros, which I believe is intentional for sparse data. I can see that the ties on zeros are easily addressed when calculating theustats
, but I never found it even mentioned when you calculate p-values. With suspecting that this is the cause, I modified my the Rcpp code in my copy:src/fast_wilcox.cpp function
cpp_rank_matrix_dgc
and also in function
cpp_in_place_rank_mean
, it seems the lastly counted ties are also not pushed back as well.It's very easy to test with something like
And if I apply the modification I described above, we can then have
OK, all of these are based on that I want the function to be able to replicate what
stats::wilcox.test()
can produce. If it was on purpose to ignore the ties on zeros and the last happening ties when calculating p-values, I would really love to learn about the reason. But overall I appreciate it so much that you have made this package!-Yichen