immunogenomics / presto

Fast Wilcoxon and auROC
144 stars 33 forks source link

Incorrect auxiliary stats on wilcoxauc with dense matrix (AvgExpr and LogFC) #7

Open bnprks opened 4 years ago

bnprks commented 4 years ago

Thanks for publishing this package! I've found it very useful to get fast wilcoxon tests, however I've noticed some of the auxiliary stats returned are incorrect which could cause issues for users.

Specifically, wilcoxauc returns incorrect AvgExpr values when called on a dense matrix. This appears to be because the function cpp_rank_matrix_dense directly modifies the matrix X, thus altering its value for later computations.

Furthermore, the results column listed as logFC is not actually returning the log fold change, but the difference in means. I think either the logFC column should be named differently to reflect what it does return (e.g. meanDiff), or its calculation should be changed to actually return the logFC between group means.

I've included a brief example to reproduce below:

X <- matrix(c(0,1,2,15,16,17), nrow=1)
y <- factor(c("A", "A", "A", "B", "B", "B"))
res <- presto::wilcoxauc(X, y)

X_sparse <- sparseMatrix(i=rep(1, 6), j=1:6, x=X[,])
res2 <- presto::wilcoxauc(X_sparse, y)

The values of res and res2 are now:

> res
   feature group avgExpr logFC statistic auc      pval      padj pct_in pct_out
1 Feature1     A       2    -3         0   0 0.0808556 0.0808556    100     100
2 Feature1     B       5     3         9   1 0.0808556 0.0808556    100     100
> res2
   feature group avgExpr logFC statistic auc      pval      padj pct_in pct_out
1 Feature1     A       1   -15         0   0 0.0808556 0.0808556    100     100
2 Feature1     B      16    15         9   1 0.0808556 0.0808556    100     100

Here, the avgExpr in res is the average rank of the values, rather than the actual average. Additionally, we see that logFC in both cases is equal to the difference in means rather than the logFC (which should be calculated as +/-4, assuming we are taking log2)