normalizeEdaseq results in no changes to read counts

zkstewart commented 2 years ago

I'm attempting to run a number of normalisation methods implemented in metaseqR2, but am having issues using EDASeq normalisation.

With my own data, or following the example given with the help message (?normalizeEdaseq), the resulting data matrix has no changes from the input counts table.

I initially installed metaseqR2 from Bioconductor, and then installed from Github just in case it'd been fixed in a more recent commit. Neither version works.

I will try to run EDASeq independent of metaseqR2 and see what happens.

zkstewart commented 2 years ago

After going through a bit of a process, I have been able to normalise my own data and, I believe, the example data. Both cases give different counts, so I have to conclude that metaseqR2 is doing something wrong.

I ran the following to get the example data normalised

dataMatrix <- metaseqR2:::exampleCountData(2000)
lengths <- round(1000*runif(nrow(dataMatrix)))
gc=runif(nrow(dataMatrix))
test.exp.set <- newSeqExpressionSet(dataMatrix,
                                       phenoData = data.frame(
                                         conditions = factor(c("A", "A", "B", "B", "B")),
                                         row.names = colnames(dataMatrix)
                                       ),
                                       featureData = data.frame(
                                         gc = gc,
                                         length = lengths,
                                         row.names = rownames(dataMatrix)
                                       )
)
test.within <- withinLaneNormalization(test.exp.set,"gc", which="full")
test.norm <- betweenLaneNormalization(test.within, which="full")
test.norm.counts.eda = normCounts(test.norm)

I'd expect normalizeEdaseq to provide a more-or-less equivalent result. Does anyone know why this isn't the case?

pmoulos commented 2 years ago

Hi @zkstewart, Thanks for reporting, I will look at it.

dfanidis commented 2 years ago

Hello @zkstewart,

I have looked into your report and I noticed that in your example you are using full method for withinLaneNormalization. metaseqR2 on the other hand uses loess which is the default method of the EDASeq function. Changing loess to full in normalizeEdaseq returns the same results as with your example (and vice versa).

# =================
# zkstewart  example
# =================

set.seed(21)
dataMatrix <- metaseqR2:::exampleCountData(2000)
lengths <- round(1000*runif(nrow(dataMatrix)))
gc=runif(nrow(dataMatrix))
test.exp.set <- newSeqExpressionSet(dataMatrix,
    phenoData = data.frame(
        conditions = factor(c("A", "A", "B", "B", "B")),
        row.names = colnames(dataMatrix)
    ),
    featureData = data.frame(
        gc = gc,
        length = lengths,
        row.names = rownames(dataMatrix)
    )
)
test.within <- withinLaneNormalization(test.exp.set,"gc", which="full")
test.norm <- betweenLaneNormalization(test.within, which="full")
test.norm.counts.eda = normCounts(test.norm)
head(test.norm.counts.eda)
## head(test.norm.counts.eda)
##           A1  A2  B1  B2  B3
## gene_1_T 202 284  94  91 188
## gene_2_F  98 150 150 133 144
## gene_3_T 128 194  93  64  89
## gene_4_F 418 202 440 716 251
## gene_5_F  26  26   2  13  17
## gene_6_F 320 425 246 284 418

#======================================
# From normalizeEDASeq
#======================================

geneData <- as.data.frame(dataMatrix)
geneCounts <- dataMatrix
classes <- factor(c("A", "A", "B", "B", "B"))
geneData$gc_content <- gc

seqGenes <- newSeqExpressionSet(
    geneCounts,
    phenoData=AnnotatedDataFrame(
        data.frame(
            conditions=classes,
            row.names=colnames(geneCounts)
        )
    ),
    featureData=AnnotatedDataFrame(
        data.frame(
            gc=geneData$gc_content,
            length=lengths,
            row.names=if (is.data.frame(geneData)) rownames(geneData)
                else names(geneData)
        )
    )
)

# ---------------------------------------
# With the metaseqR2 defaults
# ---------------------------------------
normArgs <- metaseqR2:::getDefaults("normalization", "edaseq")
normArgs
## $within.which
## [1] "loess"
#
## $between.which
## [1] "full"

seqGenes <- betweenLaneNormalization(withinLaneNormalization(seqGenes,
    "gc",which=normArgs$within.which),which=normArgs$between.which)
head(normCounts(seqGenes))
##           A1  A2  B1  B2  B3
## gene_1_T 212 277  87 102 180
## gene_2_F 125 204 199 178 189
## gene_3_T 124 187  85  70  91
## gene_4_F 549 231 548 865 302
## gene_5_F  33  33   5  15  18
## gene_6_F 339 460 274 305 408

# ----------------------------------
# Using full instead of loess 
# ----------------------------------
normArgs$within.which <- "full"

seqGenes <- betweenLaneNormalization(withinLaneNormalization(seqGenes,
    "gc",which=normArgs$within.which),which=normArgs$between.which)
head(normCounts(seqGenes))
##           A1  A2  B1  B2  B3
## gene_1_T 202 284  94  91 188
## gene_2_F  98 150 150 133 144
## gene_3_T 128 194  93  64  89
## gene_4_F 418 202 440 716 251
## gene_5_F  26  26   2  13  17
## gene_6_F 320 425 246 284 418

Hope this addresses the issue. Best, Dionysis

zkstewart commented 2 years ago

From what I can tell the script you've provided doesn't trigger the problem that I encountered since you're still calling the EDASeq methods directly. The problem arises when we use the metaseqR2 methods. The example script below demonstrates this.

# my testing

library(metaseqR2)
library(EDASeq)
set.seed(21)

# =================
# setup example data
# =================

dataMatrix <- metaseqR2:::exampleCountData(2000)
lengths <- round(1000*runif(nrow(dataMatrix)))
gc=runif(nrow(dataMatrix))
test.exp.set <- newSeqExpressionSet(dataMatrix,
                                    phenoData = data.frame(
                                      conditions = factor(c("A", "A", "B", "B", "B")),
                                      row.names = colnames(dataMatrix)
                                    ),
                                    featureData = data.frame(
                                      gc = gc,
                                      length = lengths,
                                      row.names = rownames(dataMatrix)
                                    )
)
head(dataMatrix)
## head(dataMatrix)
##           A1  A2  B1  B2  B3
## gene_1_T 169 288  53  81 246
## gene_2_F 100 215 115 138 275
## gene_3_T 100 193  52  50 124
## gene_4_F 449 254 315 648 396
## gene_5_F  29  39   3  12  28
## gene_6_F 304 529 172 237 590

# =================
# full norm with EDASeq directly
# =================

test.within <- withinLaneNormalization(test.exp.set,"gc", which="full")
test.norm <- betweenLaneNormalization(test.within, which="full")
test.norm.counts.eda = normCounts(test.norm)
head(test.norm.counts.eda)
## head(test.norm.counts.eda)
##           A1  A2  B1  B2  B3
## gene_1_T 202 284  94  91 188
## gene_2_F  98 150 150 133 144
## gene_3_T 128 194  93  64  89
## gene_4_F 418 202 440 716 251
## gene_5_F  26  26   2  13  17
## gene_6_F 320 425 246 284 418

# =================
# loess norm with EDASeq directly
# =================

test.within.loess <- withinLaneNormalization(test.exp.set,"gc", which="loess")
test.norm.loess <- betweenLaneNormalization(test.within.loess, which="full")
test.norm.counts.eda.loess = normCounts(test.norm.loess)
head(test.norm.counts.eda.loess)
## head(test.norm.counts.eda.loess)
##           A1  A2  B1  B2  B3
## gene_1_T 212 277  87 102 180
## gene_2_F 125 204 199 178 189
## gene_3_T 124 187  85  70  91
## gene_4_F 549 231 548 865 302
## gene_5_F  33  33   5  15  18
## gene_6_F 339 460 274 305 408

# =================
# loess norm through metaseqR2
# =================

sampleList <- list(A=c("A1","A2"),B=c("B1","B2","B3"))
geneData <- data.frame(
  chromosome=c(rep("chr1",nrow(dataMatrix)/2),
               rep("chr2",nrow(dataMatrix)/2)),
  start=rep(1, nrow(dataMatrix)),
  end=lengths,
  gene_id=rownames(dataMatrix),
  gc_content=gc,
  row.names=rownames(dataMatrix)
)
normArgs <- metaseqR2:::getDefaults("normalization", "edaseq")

test.norm.counts.metar2.loess = normalizeEdaseq(dataMatrix, sampleList, geneData=geneData, normArgs=normArgs)
head(test.norm.counts.metar2.loess)
## head(test.norm.counts.metar2.loess)
##           A1  A2  B1  B2  B3
## gene_1_T 169 288  53  81 246
## gene_2_F 100 215 115 138 275
## gene_3_T 100 193  52  50 124
## gene_4_F 449 254 315 648 396
## gene_5_F  29  39   3  12  28
## gene_6_F 304 529 172 237 590

# =================
# full norm through metaseqR2
# =================

normArgs$within.which = "full"

test.norm.counts.metar2.full = normalizeEdaseq(dataMatrix, sampleList, geneData=geneData, normArgs=normArgs)
head(test.norm.counts.metar2.full)
## head(test.norm.counts.metar2.full)
##           A1  A2  B1  B2  B3
## gene_1_T 169 288  53  81 246
## gene_2_F 100 215 115 138 275
## gene_3_T 100 193  52  50 124
## gene_4_F 449 254 315 648 396
## gene_5_F  29  39   3  12  28
## gene_6_F 304 529 172 237 590

Regardless of whether normArgs is configured to use loess or full, the resulting counts table is identical to the original i.e., no normalisation has occurred to it. To the best of my knowledge I emulated the expression set correctly for use by metaseqR2 as per what I see here https://rdrr.io/bioc/metaseqR2/man/normalizeEdaseq.html.

Are you able to get the metaseqR2 method normalizeEdaseq to provide normalised output that is the same as when we call EDASeq directly through withinLaneNormalization and betweenLaneNormalization ?

Thanks, Zac.

zkstewart commented 2 years ago

Digging through the source code, I believe I know what the issue is. In the file R/norm.R on line 63 (https://github.com/pmoulos/metaseqR2/blob/34a32008a0db34a6736d1ce3f1fd3afb404df3d2/R/norm.R#L63) the code reads:

return(counts(seqGenes)) # Class: matrix

It should instead use the normCounts method like we are above, and hence read as such:

return(normCounts(seqGenes)) # Class: matrix

pmoulos commented 2 years ago

Hi @zkstewart, The last commit should fix the issue. Thanks for pointing out.

pmoulos / metaseqR2

normalizeEdaseq results in no changes to read counts #15