hartleys / JunctionSeq

The JunctionSeq R package is a powerful tool for testing and visualizing differential usage of exons and splice junctions in next-generation, high-throughput RNA-Seq experiments.
28 stars 15 forks source link

Error in estimateSizeFactorsForMatrix(counts.byGenes) #50

Open HeenaBioinfo opened 4 years ago

HeenaBioinfo commented 4 years ago

jscs.noNovel <- runJunctionSeqAnalyses(sample.files = countFiles.noNovel,

  • sample.names = sample_small$sample,
  • condition=factor(sample_small$clinGroup_PILOT),
  • flat.gff.file = gff.file,
  • nCores = 1,
  • analysis.type = "junctionsAndExons") STARTING runJunctionSeqAnalyses (v1.14.0) (Wed Jul 29 16:13:26 2020) rJSA: sample.files: /space/Junction_data/rawCts/S000084.l.r.m.c.lib.g.k2.a/QC.spliceJunctionAndExonCounts.forJunctionSeq.txt.gz, /space/Junction_data/rawCts/S000556.l.r.m3.c.lib.g.k2.a/QC.spliceJunctionAndExonCounts.forJunctionSeq.txt.gz, /space/Junction_data/rawCts/S000556.l.r.m4.c.lib.g.k2.a/QC.spliceJunctionAndExonCounts.forJunctionSeq.txt.gz, /space/Junction_data/rawCts/S000556.l.r.m8.c.lib.g.k2.a/QC.spliceJunctionAndExonCounts.forJunctionSeq.txt.gz rJSA: sample.names: S000084.l.r.m.c.lib.g.k2.a, S000556.l.r.m3.c.lib.g.k2.a, S000556.l.r.m4.c.lib.g.k2.a, S000556.l.r.m8.c.lib.g.k2.a rJSA: condition: ERpHER2p, ERpHER2nLNn, ERpHER2nLNn, ERpHER2nLNn rJSA: analysis.type: junctionsAndExons rJSA: use.junctions: TRUE rJSA: use.novel.junctions: TRUE rJSA: use.exons: TRUE rJSA: nCores: 1 rJSA: use.covars:
    rJSA: test.formula0: ~ sample + countbin rJSA: test.formula1: ~ sample + countbin + condition:countbin rJSA: use.multigene.aggregates: FALSE rJSA: Reading Count files... Wed Jul 29 16:13:26 2020. -> STARTING readJunctionSeqCounts (Wed Jul 29 16:13:26 2020) ---> RJSC; (v1.14.0) ---> RJSC: samplenames: S000084.l.r.m.c.lib.g.k2.a,S000556.l.r.m3.c.lib.g.k2.a,S000556.l.r.m4.c.lib.g.k2.a,S000556.l.r.m8.c.lib.g.k2.a ---> RJSC: flat.gff.file: /space/Junction_data/flattened.gtf.gz ---> RJSC: use.exons:TRUE ---> RJSC: use.junctions:TRUE ---> RJSC: use.novel.junctions:TRUE ---> File read complete. ---> Extracted counts. Found 669130 features so far. ---> Extracted gene-level counts. Found: 47257 genes and aggregate-genes. ---> Removed gene features. Found: 621873 features to be included so far. ---> Note: 524335 counting bins from overlapping genes ---> There are 16295 multigene aggregates. ---> There are 73216 genes that are part of an aggregate. ---> Removed multigene-aggregate features. Found: 97538 features to be included so far. ---> Final feature count: 97538 features to be included in the analysis. ---> Extracted feature counts. ---> counts complete. -----> reading annotation... -----> formatting annotation... -----> initial generation... -----> creating jscs... -----> generating count vectors... (Wed Jul 29 16:14:13 2020) Using single-core execution. getAllJunctionSeqCountVectors: dim(counts) = 97538,4 (Wed Jul 29 16:14:13 2020) getAllJunctionSeqCountVectors: dim(gct) = 47257,4 getAllJunctionSeqCountVectors: out generated. dim = 97538,8 (Wed Jul 29 16:14:16 2020) -----> count vectors generated (Wed Jul 29 16:14:16 2020) -----> generating DESeqDataSet... (Wed Jul 29 16:14:16 2020) -----> DESeqDataSet generated (Wed Jul 29 16:14:16 2020) rJSA: Count files read. Wed Jul 29 16:14:16 2020. rJSA: Estimating Size Factors... Wed Jul 29 16:14:16 2020. Error in estimateSizeFactorsForMatrix(counts.byGenes) : every gene contains at least one zero, cannot compute log geometric means

Hi Can you please guide me through this error?

HeenaBioinfo commented 4 years ago

I am not able to generate the size factors, please guide..

hartleys commented 4 years ago

This occurs commonly in large samples and/or single-cell data. It can also occur if the samples are extremely variable and very different from one another (like in single-cell data). The geometric size factor calculation used by DESeq simply fails in this situation.

I would recommend skipping the size factor calculation step and replacing it with your own. For large samples / single sample data the Upper Quartile method seems to be gaining traction as the industry standard.

I would use the step-by-step analysis pipeline from the manual, but replace the estimateJunctionSeqSizeFactors step with the following:

#replace the estimation function with a version that takes an external size factor. 
# I Should have added this long ago...
estimateJunctionSeqSizeFactors.forced <- function( jscs , sizeFactors, verbose = FALSE){
   stopifnot( is( jscs, "JunctionSeqCountSet") )
   sizeFactors(jscs) <- sizeFactors
   sizeFactors(jscs@DESeqDataSet) <- rep(sizeFactors(jscs),2)
   fData(jscs)$baseMean <- rowMeans(counts(jscs, normalized= T))
   fData(jscs)$baseVar <- rowVars(counts(jscs, normalized=T))
   jscs@altSizeFactors <- data.frame()
   return(jscs);                         
}

#Calculate size factors using your favorite method.
#I recommend the upper quartile method, shown below:
SF <- sapply(1:ncol(jscs@geneCountData),function(i){
   quantile(jscs@geneCountData[,i],p=0.75);
})
#Now copy in those size factors:
jscs <- estimateJunctionSeqSizeFactors.forced(jscs,sizeFactors=SF)

For the future I'll fold this into the codebase with the next update.