Error in names(res) <- nms

mebbert commented 6 years ago

Hi,

I'm running JunctionSeq on 120 RNASeq samples (4 groups, 30 samples per group). The analysis failed with the following error:

---> STARTING estimateJunctionSeqDispersions: (v1.6.0) (Mon Jan 15 09:59:51 2018) -----> ejsd: 657434 counting bins are marked 'testable'. across 25306 genes. (432269 exonic regions, 221703 known junctions, 3462 novel junctions) ---------> Executing DESeq2 call: estimateUnsharedDispersions using supplied model matrix Error in names(res) <- nms : 'names' attribute [32] must be the same length as the vector [1] Calls: runJunctionSeqAnalyses ... estimateUnsharedDispersions -> bplapply -> bplapply In addition: Warning message: stop worker failed: 'clear_cluster' receive data failed: reached elapsed time limit Execution halted

I believe the primary error is the Error in names(res) <- nms.

I verified my job had plenty of cores and ram. I provided 32 cores and allotted 20G per core (700G; see #30). The max_vmem for the job was only 452G.

I really appreciate any help you can provide.

hartleys commented 6 years ago

My bet would be there's something wrong with your setup. Can you give me the exact command used? And also the decoder table used?

mebbert commented 6 years ago

Sure. I should have done that to begin with.

Here's my script:


require(JunctionSeq)
require(data.table)

sample.sheet<-fread("9-22-17_RNAseq_sample_list.txt", header=TRUE)
decoder.file <- "../decoderFile.txt"
decoder <- read.table(decoder.file, header=TRUE)
decoder <- merge(decoder, sample.sheet, by.x="sample.ID", by.y="Shipping.ID")
gff.file <- "../qoRTs_QC_out/withNovel.forJunctionSeq.gff.gz"
countFiles <- paste0("../qoRTs_QC_out/", decoder$sample.ID, "/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz")
jscs <- runJunctionSeqAnalyses(sample.files = countFiles,
                               sample.names = decoder$sample.ID,
                               condition=factor(decoder$Disease.status),
                               flat.gff.file = gff.file,
                               nCores = 32,
                               analysis.type = "junctionsAndExons",
                               debug.mode = TRUE
                               );

cat("\n\nWriting results...")
writeCompleteResults(jscs,
                     outfile.prefix="./RNASeq_120-",
                     save.jscs = TRUE
cat("Done.")

decoderFile.txt

hartleys commented 6 years ago

could you post the sample.sheet file too?

mebbert commented 6 years ago

I can't post the exact sample.sheet because it contains protected data. I did verify that the merge works properly. The resulting data.frame has 120 rows and the appropriate number of columns. Looks correct on visual inspection. The columns Disease.status contains the following factors: Pos.Symptomatic, Pos.Asymptomatic, Neg.Symptomatic, and CTRL.

Is there anything else I can check? Or, I can post a subset of the sample.sheet. Let me know what's easiest for you.

Really appreciate it.

hartleys commented 6 years ago

A couple other possibilities:

Try switching nCores to 1. The problem appears to be coming from BiocParallel. Many of the bioconductor packages including BiocParallel have suffered a bit from cross-version behavior weirdness. If you set it to nCores=1, then it'll attempt single-core execution, bypassing BiocParallel.

Also: if you change nCores to some number other than 32, does the number in the error message change to that number?

Other than that, my bet would be to just check to make sure the merge went properly, and you don't have any typos in the sample sheet. Common problems often involve whitespace or capitalization...

mebbert commented 6 years ago

Thank you, I really appreciate it. That suggestion goes back to issue #30. Even using 5 cores, the job died after the 5-day allotment.

Questions:

Any idea how long this would take using a single core?
Can I safely break up the QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz file by gene ID and submit multiple jobs?

hartleys commented 6 years ago

Hmm. You can try switching to Junctions-Only mode. That should reduce the runtime. I have also found that the junction-based tests seem to have lower false discovery rates, both in real data and in simulation testing. So I usually prefer them anyways.

Other alternatives: split the genome into parts. the flattened GFF and the count files have 1-to-1 matching lines, so splitting it up shouldn't be too terribly difficult. Just make sure not to split genes.

The real problem is that JunctionSeq was never designed with datasets of this scale in mind. It does a bunch of dispersion estimation processing that were only necessary in order to get around the small sample sizes typical of RNA-Seq datasets from that era (three years ago). With larger datasets, much of that processing is rendered moot.

But adapting the inner workings of the software to efficiently deal with large scale datasets is really a whole new project in its own right, and I don't think it's something that I have the free time for.

I also don't have a large RNA Seq dataset on hand....

hartleys commented 6 years ago

Here's a quick and dirty script that I threw together that will pull a subset of genes out of the dataset and run on that subset. You would run these in parallel, setting batchID to 1:N, where N is the number of genes divided by batch size (rounded up).

#Set a batch id for each job from 1:N
batchID <- 1;
#Set this to a reasonable number. Say 2500 genes, maybe?
batchSize <- 2500;

library("DESeq2")
library("JunctionSeq")
#Load the dataset:
design <- data.frame(condition = factor(decoder$group.ID));
jscs <- readJunctionSeqCounts(
   countfiles = countFiles,
   samplenames = decoder$sample.ID,
   design = design,
   flat.gff.file = gff.file
);
#Estimate size factors (based on full dataset):
jscs <- estimateJunctionSeqSizeFactors(jscs);

#cut the dataset down to a smaller batch of genes:
geneList <- as.character(unique(fData(jscs)$geneID));
length(geneList)
batchGeneList <- geneList[(batchSize*(batchID-1)):min((batchSize*batchID-1),length(geneList))];
keep <- fData(jscs)$geneID %in% batchGeneList
sum(keep)

#Mess with all the internal references to cut the dataset down:
fData(jscs) <- fData(jscs)[keep,]
counts(jscs) <- counts(jscs)[keep,]
jscs@countVectors <- jscs@countVectors[keep,]
jscs@DESeqDataSet <- DESeqDataSetFromMatrix(countData = counts(jscs@DESeqDataSet)[keep,], 
                      colData = jscs@DESeqDataSet@colData, 
                      design = jscs@DESeqDataSet@design, 
                      ignoreRank = TRUE);

#From here on out it's basically just a normal JunctionSeq run.
# All the modifications are just optional suggestions
jscs <- estimateJunctionSeqDispersions(jscs, nCores = 1);

#get the fitted dispersions, because the dispersion fit is still needed for 
# effect size estimation, etc.
jscs <- fitJunctionSeqDispersionFunction(jscs);
#use "noShare" mode for the actual hypothesis tests:
jscs <- fitJunctionSeqDispersionFunction(jscs,  method.dispFinal = "noShare");
jscs <- testForDiffUsage(jscs, nCores = 1);
jscs <- estimateEffectSizes( jscs, nCores = 1);

mebbert commented 6 years ago

Super helpful, thank you. I'll report back.

mebbert commented 6 years ago

@hartleys, Thanks again. Your script splitting up the genes worked great. Also, I installed a fresh R (3.4.2) and Bioconductor (3.6) and the original error associated with BiocParallel seems to be fixed. Not sure what the problem was. I was originally on R 3.4.1 and Bioconductor 3.5.

hartleys / JunctionSeq

Error in names(res) <- nms #31