thelovelab / tximport

Transcript quantification import for modular pipelines
136 stars 33 forks source link

Allow missing-ness in individual transcriptomes #14

Closed mdshw5 closed 8 years ago

mdshw5 commented 8 years ago

I have use cases for building matrices of TPM/counts where the length(txId) != length(raw[[txIdCol]]):

  1. Combining data from human/mouse xenografts with measure of the same human or mouse genetic background. Obviously you'll want to account for all human/mouse reads during mapping and quantitation. Maybe you could filter unwanted transcript measurements from your input files before tximport, but this doesn't scale nicely.
  2. Newer experiments where quantitation included a small number of "extra" transcripts such as virus or noncoding RNAs.
  3. Newer experiments where quantitation was performed using the same transcripts but sorted differently.

I've created a patch that allows the user to combine input files with a subset of common features. Maybe this is of general interest? Sorry about the random edits in the diff - if you update the repository and Github I could do a proper PR.

From 1cf014290593e9fceec4decfc364af9c34ecf905 Mon Sep 17 00:00:00 2001
From: Matt Shirley <mdshw5@gmail.com>
Date: Sun, 10 Apr 2016 14:30:36 -0400
Subject: [PATCH 1/4] Match rows during Salmon import, supporting import of
 sub/superset transcriptomes.

---
 R/tximport.R | 75 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 38 insertions(+), 37 deletions(-)

diff --git a/R/tximport.R b/R/tximport.R
index 93c1dc4..a80a4bc 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -4,7 +4,7 @@
 #' external software and optionally summarizes abundances, counts, and transcript lengths
 #' to the gene-level (default) or outputs transcript-level matrices
 #' (see \code{txOut} argument).
-#' While \code{tximport} summarizes to the gene-level by default, 
+#' While \code{tximport} summarizes to the gene-level by default,
 #' the user can also perform the import and summarization steps manually,
 #' by specifing \code{txOut=TRUE} and then using the function \code{summarizeToGene}.
 #' Note however that this is equivalent to \code{tximport} with
@@ -17,7 +17,7 @@
 #'   \item avoid gene-level summarization by specifying \code{txOut=TRUE}
 #'   \item set \code{geneIdCol} to an appropriate column in the files
 #' }
-#' 
+#'
 #' See \code{vignette('tximport')} for example code for generating a
 #' \code{tx2gene} data.frame from a \code{TxDb} object.
 #' Note that the \code{keys} and \code{select} functions used
@@ -41,16 +41,16 @@
 #' @param countsFromAbundance character, either "no" (default), "scaledTPM", or "lengthScaledTPM",
 #' for whether to generate estimated counts using abundance estimates scaled up to library size
 #' (scaledTPM) or additionally scaled using the average transcript length over samples and
-#' the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM, 
+#' the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM,
 #' then the counts are no longer correlated with average transcript length,
 #' and so the length offset matrix should not be used.
 #' @param tx2gene a two-column data.frame linking transcript id (column 1) to gene id (column 2).
-#' the column names are not relevant, but this column order must be used. 
+#' the column names are not relevant, but this column order must be used.
 #' this argument is required for gene-level summarization for methods
 #' that provides transcript-level estimates only
 #' (kallisto, Salmon, Sailfish)
 #' @param reader a function to replace read.delim in the pre-set importer functions,
-#' for example substituting read_tsv from the readr package will substantially 
+#' for example substituting read_tsv from the readr package will substantially
 #' speed up tximport
 #' @param geneIdCol name of column with gene id. if missing, the gene2tx argument can be used
 #' @param txIdCol name of column with tx id
@@ -66,7 +66,7 @@
 #' (default FALSE)
 #' @param txi list of matrices of trancript-level abundances, counts, and
 #' lengths produced by \code{tximport}, only used by \code{summarizeToGene}
-#' 
+#'
 #' @return a simple list with matrices: abundance, counts, length.
 #' A final element 'countsFromAbundance' carries through
 #' the character argument used in the tximport call.
@@ -77,20 +77,20 @@
 #' @describeIn tximport Import tx-level quantifications and summarize
 #' abundances, counts and lengths to gene-level (default)
 #' or simply output tx-level matrices
-#' 
+#'
 #' @references
 #'
 #' Charlotte Soneson, Michael I. Love, Mark D. Robinson (2015):
 #' Differential analyses for RNA-seq: transcript-level estimates
 #' improve gene-level inferences. F1000Research.
 #' \url{http://dx.doi.org/10.12688/f1000research.7563.1}
-#' 
+#'
 #' @examples
 #'
 #' # load data for demonstrating tximport
 #' # note that the vignette shows more examples
 #' # including how to read in files quickly using the readr package
-#' 
+#'
 #' library(tximportData)
 #' dir <- system.file("extdata", package="tximportData")
 #' samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
@@ -124,7 +124,7 @@ tximport <- function(files,
   countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM"))
   stopifnot(all(file.exists(files)))
   if (!txIn & txOut) stop("txOut only an option when transcript-level data is read in (txIn=TRUE)")
-  
+
   # kallisto presets
   if (type == "kallisto") {
     geneIdCol="gene_id"
@@ -134,7 +134,7 @@ tximport <- function(files,
     lengthCol <- "eff_length"
     importer <- reader
     }
-  
+
   # salmon/sailfish presets
   if (type %in% c("salmon","sailfish")) {
     geneIdCol="gene_id"
@@ -142,9 +142,9 @@ tximport <- function(files,
     abundanceCol <- "TPM"
     countsCol <- "NumReads"
     lengthCol <- "EffectiveLength"
-    importer <- function(x) reader(x, comment='#') 
+    importer <- function(x) reader(x, comment='#')
   }
-  
+
   # rsem presets
   if (type == "rsem") {
     txIn <- FALSE
@@ -154,11 +154,11 @@ tximport <- function(files,
     lengthCol <- "effective_length"
     importer <- reader
   }
-  
+
   if (type == "cufflinks") {
     stop("reading from collated files not yet implemented")
   }
-  
+
   # if input is tx-level, need to summarize abundances, counts and lengths to gene-level
   if (txIn) {
     message("reading in files")
@@ -171,7 +171,7 @@ tximport <- function(files,
       if ((i == 1) &
           (type %in% c("salmon","sailfish")) &
           !("EffectiveLength" %in% names(raw))) {
-        lengthCol <- "Length" 
+        lengthCol <- "Length"
         # because the comment lines have the same comment character
         # as the header, need to name the column names
         importer <- function(x) {
@@ -183,7 +183,7 @@ tximport <- function(files,
         raw <- as.data.frame(importer(files[i]))
       }
       #####################################################################
-      
+
       # does the table contain gene association or was an external tx2gene table provided?
       if (is.null(tx2gene) & !txOut) {
         # e.g. Cufflinks includes the gene ID in the table
@@ -204,11 +204,11 @@ tximport <- function(files,
         }
       } else {
         # e.g. Salmon and kallisto do not include the gene ID, need an external table
-        stopifnot(all(c(lengthCol, abundanceCol) %in% names(raw)))
+        stopifnot(any(c(lengthCol, abundanceCol) %in% names(raw)))
         if (i == 1) {
           txId <- raw[[txIdCol]]
         } else {
-          stopifnot(all(txId == raw[[txIdCol]]))
+          stopifnot(any(txId == raw[[txIdCol]]))
         }
       }
       # create empty matrices
@@ -220,9 +220,11 @@ tximport <- function(files,
         countsMatTx <- mat
         lengthMatTx <- mat
       }
-      abundanceMatTx[,i] <- raw[[abundanceCol]]
-      countsMatTx[,i] <- raw[[countsCol]]
-      lengthMatTx[,i] <- raw[[lengthCol]]
+      idx <- match(txId, raw[[txIdCol]])
+      abundanceMatTx[,i] <- raw[idx, abundanceCol]
+      countsMatTx[,i] <- raw[idx, countsCol]
+      lengthMatTx[,i] <- raw[idx, lengthCol]
+      lengthMatTx[,i] <- raw[idx, lengthCol]
     }
     message("")

@@ -236,12 +238,12 @@ tximport <- function(files,

     txi[["countsFromAbundance"]] <- NULL
     txiGene <- summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance)
-    return(txiGene)  
-    
+    return(txiGene)
+
   # e.g. RSEM already has gene-level summaries
   # just combine the gene-level summaries across files
   } else {
-  
+
     message("reading in files")
     for (i in seq_along(files)) {
       message(i," ",appendLF=FALSE)
@@ -259,7 +261,7 @@ tximport <- function(files,
       countsMat[,i] <- raw[[countsCol]]
       lengthMat[,i] <- raw[[lengthCol]]
     }
-  } 
+  }
   message("")
   return(list(abundance=abundanceMat, counts=countsMat, length=lengthMat,
               countsFromAbundance="no"))
@@ -283,11 +285,11 @@ summarizeToGene <- function(txi,
   abundanceMatTx <- txi$abundance
   countsMatTx <- txi$counts
   lengthMatTx <- txi$length
-  
+
   txId <- rownames(abundanceMatTx)
   stopifnot(all(txId == rownames(countsMatTx)))
   stopifnot(all(txId == rownames(lengthMatTx)))
-  
+
   # need to associate tx to genes
   # potentially remove unassociated transcript rows and warn user
   if (!is.null(tx2gene)) {
@@ -309,20 +311,20 @@ summarizeToGene <- function(txi,
     txId <- txId[sub.idx]
     geneId <- tx2gene$gene[match(txId, tx2gene$tx)]
   }
-  
+
   # summarize abundance and counts
   message("summarizing abundance")
   abundanceMat <- rowsum(abundanceMatTx, geneId)
   message("summarizing counts")
   countsMat <- rowsum(countsMatTx, geneId)
   message("summarizing length")
-  
-  # the next lines calculate a weighted average of transcript length, 
+
+  # the next lines calculate a weighted average of transcript length,
   # weighting by transcript abundance.
   # this can be used as an offset / normalization factor which removes length bias
   # for the differential analysis of estimated counts summarized at the gene level.
   weightedLength <- rowsum(abundanceMatTx * lengthMatTx, geneId)
-  lengthMat <- weightedLength / abundanceMat   
+  lengthMat <- weightedLength / abundanceMat

   # pre-calculate a simple average transcript length
   # for the case the abundances are all zero for all samples.
@@ -332,14 +334,14 @@ summarizeToGene <- function(txi,
   aveLengthSampGene <- tapply(aveLengthSamp, geneId, mean)

   stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat)))
-  
+
   # check for NaN and if possible replace these values with geometric mean of other samples.
   # (the geometic mean here implies an offset of 0 on the log scale)
-  # NaN come from samples which have abundance of 0 for all isoforms of a gene, and 
+  # NaN come from samples which have abundance of 0 for all isoforms of a gene, and
   # so we cannot calculate the weighted average. our best guess is to use the average
   # transcript length from the other samples.
   lengthMat <- replaceMissingLength(lengthMat, aveLengthSampGene)
-  
+
   if (countsFromAbundance != "no") {
     countsSum <- colSums(countsMat)
     if (countsFromAbundance == "lengthScaledTPM") {
@@ -350,7 +352,7 @@ summarizeToGene <- function(txi,
     newSum <- colSums(newCounts)
     countsMat <- t(t(newCounts) * (countsSum/newSum))
   }
-  
+
   return(list(abundance=abundanceMat, counts=countsMat, length=lengthMat,
               countsFromAbundance=countsFromAbundance))
 }
@@ -383,4 +385,3 @@ replaceMissingLength <- function(lengthMat, aveLengthSampGene) {
 ##            dimnames=list(levels(f), colnames(m)))
 ##   }
 ## }
-
-- 
2.5.4 (Apple Git-61)

From 26ce2bd0d703df9d210aff0115e0888c5343b1d2 Mon Sep 17 00:00:00 2001
From: Matt Shirley <mdshw5@gmail.com>
Date: Sun, 10 Apr 2016 14:41:25 -0400
Subject: [PATCH 2/4] Test for any membership.

---
 R/tximport.R | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R/tximport.R b/R/tximport.R
index a80a4bc..d01bd68 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -208,7 +208,7 @@ tximport <- function(files,
         if (i == 1) {
           txId <- raw[[txIdCol]]
         } else {
-          stopifnot(any(txId == raw[[txIdCol]]))
+          stopifnot(any(txId %in% raw[[txIdCol]]))
         }
       }
       # create empty matrices
-- 
2.5.4 (Apple Git-61)

From 005267ccf9c4df054c2b92a034b7bea12a54c6ee Mon Sep 17 00:00:00 2001
From: Matt Shirley <mdshw5@gmail.com>
Date: Sun, 10 Apr 2016 15:13:19 -0400
Subject: [PATCH 3/4] Add argument exposing allowMissingIds.

---
 R/tximport.R | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/R/tximport.R b/R/tximport.R
index d01bd68..2fc53c6 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -118,7 +118,8 @@ tximport <- function(files,
                      lengthCol,
                      importer,
                      collatedFiles,
-                     ignoreTxVersion=FALSE) {
+                     ignoreTxVersion=FALSE,
+                     allowMissingIds=FALSE) {

   type <- match.arg(type, c("none","kallisto","salmon","sailfish","rsem"))
   countsFromAbundance <- match.arg(countsFromAbundance, c("no","scaledTPM","lengthScaledTPM"))
@@ -159,6 +160,9 @@ tximport <- function(files,
     stop("reading from collated files not yet implemented")
   }

+  # choose whether to allow missing transcripts
+  if (allowMissingIds) { allAny <- any } else { allAny <- all }
+  
   # if input is tx-level, need to summarize abundances, counts and lengths to gene-level
   if (txIn) {
     message("reading in files")
@@ -196,19 +200,20 @@ tximport <- function(files,

 ")
         }
-        stopifnot(all(c(lengthCol, abundanceCol) %in% names(raw)))
+        stopifnot(allAny(c(lengthCol, abundanceCol) %in% names(raw)))
         if (i == 1) {
           geneId <- raw[[geneIdCol]]
         } else {
-          stopifnot(all(geneId == raw[[geneIdCol]]))
+          stopifnot(allAny(geneId %in% raw[[geneIdCol]]))
         }
       } else {
         # e.g. Salmon and kallisto do not include the gene ID, need an external table
-        stopifnot(any(c(lengthCol, abundanceCol) %in% names(raw)))
+        stopifnot(allAny(c(lengthCol, abundanceCol) %in% names(raw)))
         if (i == 1) {
           txId <- raw[[txIdCol]]
         } else {
-          stopifnot(any(txId %in% raw[[txIdCol]]))
+          stopifnot(allAny(txId %in% raw[[txIdCol]]))
+          idx <- match(txId, raw[[txIdCol]])
         }
       }
       # create empty matrices
@@ -220,7 +225,6 @@ tximport <- function(files,
         countsMatTx <- mat
         lengthMatTx <- mat
       }
-      idx <- match(txId, raw[[txIdCol]])
       abundanceMatTx[,i] <- raw[idx, abundanceCol]
       countsMatTx[,i] <- raw[idx, countsCol]
       lengthMatTx[,i] <- raw[idx, lengthCol]
-- 
2.5.4 (Apple Git-61)

From 2870a1d82d1ffc4a3a576d1fe8bcb19e65cff2d0 Mon Sep 17 00:00:00 2001
From: Matt Shirley <mdshw5@gmail.com>
Date: Sun, 10 Apr 2016 15:20:04 -0400
Subject: [PATCH 4/4] Fix index generation logic.

---
 R/tximport.R | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/R/tximport.R b/R/tximport.R
index 2fc53c6..7816f74 100644
--- a/R/tximport.R
+++ b/R/tximport.R
@@ -162,7 +162,7 @@ tximport <- function(files,

   # choose whether to allow missing transcripts
   if (allowMissingIds) { allAny <- any } else { allAny <- all }
-  
+
   # if input is tx-level, need to summarize abundances, counts and lengths to gene-level
   if (txIn) {
     message("reading in files")
@@ -213,7 +213,6 @@ tximport <- function(files,
           txId <- raw[[txIdCol]]
         } else {
           stopifnot(allAny(txId %in% raw[[txIdCol]]))
-          idx <- match(txId, raw[[txIdCol]])
         }
       }
       # create empty matrices
@@ -225,6 +224,7 @@ tximport <- function(files,
         countsMatTx <- mat
         lengthMatTx <- mat
       }
+      idx <- match(txId, raw[[txIdCol]])
       abundanceMatTx[,i] <- raw[idx, abundanceCol]
       countsMatTx[,i] <- raw[idx, countsCol]
       lengthMatTx[,i] <- raw[idx, lengthCol]
-- 
2.5.4 (Apple Git-61)
mikelove commented 8 years ago

hi Matt,

Thanks for filing this issue and your contributions to tximport.

I think I'm going to keep the codebase as is (not incorporate this change) because for me this bug is really an important feature. I think in 99% of the cases, users should not be running DE tools on quantifications produced using different indices. I think, if you add transcripts to the index, you really need to re-quantify the old files using the same index, as a new index adds potential technical artifacts for comparison (changing the optimum reached by the EM). Luckily these quantifiers only take a few minute per sample, so even for massive projects, this takes 1-2 days. I suppose there could be an edge case where the files are quantified using the exact same index, but an index which was produced by resorting the transcript order. But this sound fishy and unlikely. I think the current check that stopifnot(all(txId == raw[[txIdCol]])) is basically a realistic check that you're not accidentally comparing files which were quantified with a different index.

All this being said, the solution if you really want to combine tximport-ed matrices across different indices is much simpler in my opinion if you do this after running tximport:

idx <- intersect(rownames(txi1$counts), rownames(txi2$counts))
counts.new <- cbind(txi1$counts[idx,], txi2$counts[idx,])

I'd prefer this post-hoc approach, so that users don't expect tximport will take care of files processed using a different index automatically.

mdshw5 commented 8 years ago

Yeah, I think this is reasonable. Thanks for your time, and the great tool!