lgeistlinger / EnrichmentBrowser

Seamless navigation through combined results of set-based and network-based enrichment analysis
20 stars 11 forks source link

Non-integer RNASeq data pass to deAna() when de.method = "limma" #11

Closed mctseng2 closed 6 years ago

mctseng2 commented 6 years ago

Thanks for the great package! This is not a bug but my personal concern . I was just curious about how the deAna() know the data type (micro array or RNAseq) it receive and found out this small function .detectDataType() inside deAna() which tend to categorize the data to RNAseq if the matrix is integer otherwise to microarray. Since I am using salmon and tximport which generates non-interger count matrix (due to scaling to transcript length and library size). It seems that deAna() will go for the microarray procedure instead of correctly selecting rseq.

This is my count matirx


head(d.filt$counts)[,1:3]
                      S1          S2          S3
ENSMUSG00000000001.4  1290.56200  1410.50686  1921.12816
ENSMUSG00000000028.14   58.02667    41.99000    55.20249
ENSMUSG00000000031.16   14.79151    27.29881    44.81135
ENSMUSG00000000056.7   561.58500   900.47103   822.54670
ENSMUSG00000000058.6  1026.76052   969.94733  2204.41456
ENSMUSG00000000078.7  8058.69193 10324.71306 11075.19136

This is when I tried to hack in .detectDataType() to see if the function correctly tells the datatype:


ifelse(all(EnrichmentBrowser:::.isWholenumber(d.filt$counts), na.rm = TRUE), "rseq", "ma")
[1] "ma"

I am wondering if the data.type argument can be added to deAna() because some users might not aware of it and might get tripped. Thanks!

lgeistlinger commented 6 years ago

Correct, in this case the data type would be detected as "ma".

Note that in effect this distinction between data types is mainly done for deciding which DE and EA methods to apply.

The main concern when providing raw RNA-seq read count data (for which transcript length and library size has not been taken into account), is that DE and EA microarray methods are not per se applicable.

I think what you are having here are TPMs, which typically allow, except for maybe an additional variance stabilizing transformation, the application of
methods as developed for microarray data.

Note also that you can always set the data type for your SummarizedExperiment named se by:

metadata(se)$dataType <- "rseq"

However, depending on what you are intending to do, "ma" would be indeed the more appropriate choice here. At least concerning the enrichment analysis.