immunogenomics / presto

Fast Wilcoxon and auROC
167 stars 35 forks source link

LogFC Clarification #6

Open christopher-hardy opened 4 years ago

christopher-hardy commented 4 years ago

Hi,

Thank you for your work- it is really remarkable.

Not sure if this is a bug or a feature, but when doing a comparison between presto::wilcoxauc and Seurat::FindMarkers I noticed a difference in the LogFC calculation. When calculating the LogFC from a Seurat Data Object, the values are untransformed prior to calculating the mean, then re-log-transformed on the mean values- plus a pseudo count of default 1 to avoid Inf LogFCs (Line 551 @ https://github.com/satijalab/seurat/blob/master/R/differential_expression.R). This difference causes changes in the avgExpr and logFC when compared to the Seurat output. I was wondering if you could clarify whether you believe your approach is correct, or if this a potential issue with an edge case when the data are previously log transformed?

Note: If helpful I am able to replicate the values in Seurat with a couple slight modifications (see lines 1, 7 and 11 below).

  X <- expm1(X)
  group_sums <- sumGroups(X, y, 1)
  group_nnz <- nnzeroGroups(X, y, 1)
  group_pct <- sweep(group_nnz, 1, as.numeric(table(y)), "/") %>% t()
  group_pct_out <- -group_nnz %>% sweep(2, colSums(group_nnz), 
                                        "+") %>% sweep(1, as.numeric(length(y) - table(y)), "/") %>% t()
  group_means <- log(sweep(group_sums, 1, as.numeric(table(y)), "/") + 1) %>% t()
  cs <- colSums(group_sums)
  gs <- as.numeric(table(y))
  lfc <- Reduce(cbind, lapply(seq_len(length(levels(y))), function(g) {
    group_means[, g] - (log((cs - group_sums[g, ]) / (length(y) - gs[g]) + 1))
  }))

Thanks!

Chris

ireen-kal commented 1 year ago

Hi! I also noticed differences between presto::wilcoxauc and Seurat::FindMarkers in the reported logFC and avg_log2FC respectively. I was wondering if you could clarify this a bit more, and whether one or the other calculation would be better?

slowkow commented 11 months ago

Hey @ilyakorsunsky I think there is a - where there should be a / on this line:

https://github.com/immunogenomics/presto/blob/31dc97fed5e2e7fc323ae4af62f72181cc51d9a3/R/wilcoxauc.R#L192

See the example below for details.

Example

Here is a complete example you can try running on your own:

Generate random data:

library(magrittr)
set.seed(42)
x1 <- rnorm(10) + 3
x2 <- rnorm(10) + 3

This is the fold-change:

mean(x1) / mean(x2)
#> [1] 1.25057

And the log fold-change:

log(mean(x1) / mean(x2))
#> [1] 0.2235997

OK so now let’s take a closer look at the presto::wilcoxauc() function to see if we are getting the correct value for log fold-change:

X <- matrix(c(x1, x2), nrow = 1)
y <- factor(rep(c(1, 2), each = 10))
presto::wilcoxauc(X, y)
#>    feature group  avgExpr      logFC statistic  auc      pval      padj pct_in
#> 1 Feature1     1 3.547297  0.7107535        66 0.66 0.2413216 0.2413216    100
#> 2 Feature1     2 2.836543 -0.7107535        34 0.34 0.2413216 0.2413216    100
#>   pct_out
#> 1     100
#> 2     100

OK, we wanted 0.22 but we got 0.71, so maybe something is not right.

Let's go through each line of the presto::wilcoxauc() function, one line at a time, and see where we might be going wrong:

### Auxiliary Statistics (AvgExpr, PctIn, LFC, etc)
group_sums <- presto::sumGroups(X, y, 1)
group_sums
#>          [,1]
#> [1,] 35.47297
#> [2,] 28.36543
group_nnz <- presto::nnzeroGroups(X, y, 1)
group_nnz
#>      [,1]
#> [1,]   10
#> [2,]   10
group_pct <- sweep(group_nnz, 1, as.numeric(table(y)), "/") %>% t()
group_pct
#>      [,1] [,2]
#> [1,]    1    1
group_pct_out <- -group_nnz %>%
  sweep(2, colSums(group_nnz) , "+") %>% 
  sweep(1, as.numeric(length(y) - table(y)), "/") %>% t()
group_pct_out
#>      [,1] [,2]
#> [1,]    1    1
group_means <- sweep(group_sums, 1, as.numeric(table(y)), "/") %>% t()
group_means
#>          [,1]     [,2]
#> [1,] 3.547297 2.836543
cs <- colSums(group_sums)
cs
#> [1] 63.8384
gs <- as.numeric(table(y))
gs
#> [1] 10 10

So far so good.

But this is giving us 0.71 instead of 0.22:

lfc <- Reduce(cbind, lapply(seq_len(length(levels(y))), function(g) {
  group_means[, g] - ((cs - group_sums[g, ]) / (length(y) - gs[g]))
}))
lfc
#>           init           
#> [1,] 0.7107535 -0.7107535

After we change - to / and add a %>% log() at the end, we get the correct numbers:

lfc <- Reduce(cbind, lapply(seq_len(length(levels(y))), function(g) {
  group_means[, g] / ((cs - group_sums[g, ]) / (length(y) - gs[g]))
})) %>% log()
lfc
#>           init           
#> [1,] 0.2235997 -0.2235997
TeodoraTockovska commented 9 months ago

I also am getting different results from FindAllMarkers and presto's function (presto:::wilcoxauc.Seurat). I am running presto as shown below:

markers_rna <- presto:::wilcoxauc.Seurat(X = seur_obj, 
                                         group_by = 'wsnn_res.0.8', assay = 'data', 
                                         seurat_assay = 'SCT')
WT215 commented 8 months ago

Hey @ilyakorsunsky I think there is a - where there should be a / on this line:

https://github.com/immunogenomics/presto/blob/31dc97fed5e2e7fc323ae4af62f72181cc51d9a3/R/wilcoxauc.R#L192

See the example below for details.

Example

Here is a complete example you can try running on your own:

Generate random data:

library(magrittr)
set.seed(42)
x1 <- rnorm(10) + 3
x2 <- rnorm(10) + 3

This is the fold-change:

mean(x1) / mean(x2)
#> [1] 1.25057

And the log fold-change:

log(mean(x1) / mean(x2))
#> [1] 0.2235997

OK so now let’s take a closer look at the presto::wilcoxauc() function to see if we are getting the correct value for log fold-change:

X <- matrix(c(x1, x2), nrow = 1)
y <- factor(rep(c(1, 2), each = 10))
presto::wilcoxauc(X, y)
#>    feature group  avgExpr      logFC statistic  auc      pval      padj pct_in
#> 1 Feature1     1 3.547297  0.7107535        66 0.66 0.2413216 0.2413216    100
#> 2 Feature1     2 2.836543 -0.7107535        34 0.34 0.2413216 0.2413216    100
#>   pct_out
#> 1     100
#> 2     100

OK, we wanted 0.22 but we got 0.71, so maybe something is not right.

Let's go through each line of the presto::wilcoxauc() function, one line at a time, and see where we might be going wrong:

### Auxiliary Statistics (AvgExpr, PctIn, LFC, etc)
group_sums <- presto::sumGroups(X, y, 1)
group_sums
#>          [,1]
#> [1,] 35.47297
#> [2,] 28.36543
group_nnz <- presto::nnzeroGroups(X, y, 1)
group_nnz
#>      [,1]
#> [1,]   10
#> [2,]   10
group_pct <- sweep(group_nnz, 1, as.numeric(table(y)), "/") %>% t()
group_pct
#>      [,1] [,2]
#> [1,]    1    1
group_pct_out <- -group_nnz %>%
  sweep(2, colSums(group_nnz) , "+") %>% 
  sweep(1, as.numeric(length(y) - table(y)), "/") %>% t()
group_pct_out
#>      [,1] [,2]
#> [1,]    1    1
group_means <- sweep(group_sums, 1, as.numeric(table(y)), "/") %>% t()
group_means
#>          [,1]     [,2]
#> [1,] 3.547297 2.836543
cs <- colSums(group_sums)
cs
#> [1] 63.8384
gs <- as.numeric(table(y))
gs
#> [1] 10 10

So far so good.

But this is giving us 0.71 instead of 0.22:

lfc <- Reduce(cbind, lapply(seq_len(length(levels(y))), function(g) {
  group_means[, g] - ((cs - group_sums[g, ]) / (length(y) - gs[g]))
}))
lfc
#>           init           
#> [1,] 0.7107535 -0.7107535

After we change - to / and add a %>% log() at the end, we get the correct numbers:

lfc <- Reduce(cbind, lapply(seq_len(length(levels(y))), function(g) {
  group_means[, g] / ((cs - group_sums[g, ]) / (length(y) - gs[g]))
})) %>% log()
lfc
#>           init           
#> [1,] 0.2235997 -0.2235997

I tried this modification, in addition to that, I added 0.1 to the count matrix to avoid Inf. Then the returned logFC seems to be consistent with the default logFC. So I guess the purpose of original logFC was to avoid Inf?