Closed mebbert closed 6 years ago
My bet would be there's something wrong with your setup. Can you give me the exact command used? And also the decoder table used?
Sure. I should have done that to begin with.
Here's my script:
require(JunctionSeq)
require(data.table)
sample.sheet<-fread("9-22-17_RNAseq_sample_list.txt", header=TRUE)
decoder.file <- "../decoderFile.txt"
decoder <- read.table(decoder.file, header=TRUE)
decoder <- merge(decoder, sample.sheet, by.x="sample.ID", by.y="Shipping.ID")
gff.file <- "../qoRTs_QC_out/withNovel.forJunctionSeq.gff.gz"
countFiles <- paste0("../qoRTs_QC_out/", decoder$sample.ID, "/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz")
jscs <- runJunctionSeqAnalyses(sample.files = countFiles,
sample.names = decoder$sample.ID,
condition=factor(decoder$Disease.status),
flat.gff.file = gff.file,
nCores = 32,
analysis.type = "junctionsAndExons",
debug.mode = TRUE
);
cat("\n\nWriting results...")
writeCompleteResults(jscs,
outfile.prefix="./RNASeq_120-",
save.jscs = TRUE
cat("Done.")
could you post the sample.sheet file too?
I can't post the exact sample.sheet
because it contains protected data. I did verify that the merge works properly. The resulting data.frame
has 120 rows and the appropriate number of columns. Looks correct on visual inspection. The columns Disease.status
contains the following factors: Pos.Symptomatic
, Pos.Asymptomatic
, Neg.Symptomatic
, and CTRL
.
Is there anything else I can check? Or, I can post a subset of the sample.sheet
. Let me know what's easiest for you.
Really appreciate it.
A couple other possibilities:
Try switching nCores to 1. The problem appears to be coming from BiocParallel. Many of the bioconductor packages including BiocParallel have suffered a bit from cross-version behavior weirdness. If you set it to nCores=1, then it'll attempt single-core execution, bypassing BiocParallel.
Also: if you change nCores to some number other than 32, does the number in the error message change to that number?
Other than that, my bet would be to just check to make sure the merge went properly, and you don't have any typos in the sample sheet. Common problems often involve whitespace or capitalization...
Thank you, I really appreciate it. That suggestion goes back to issue #30. Even using 5 cores, the job died after the 5-day allotment.
Questions:
QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz
file by gene ID and submit multiple jobs?Hmm. You can try switching to Junctions-Only mode. That should reduce the runtime. I have also found that the junction-based tests seem to have lower false discovery rates, both in real data and in simulation testing. So I usually prefer them anyways.
Other alternatives: split the genome into parts. the flattened GFF and the count files have 1-to-1 matching lines, so splitting it up shouldn't be too terribly difficult. Just make sure not to split genes.
The real problem is that JunctionSeq was never designed with datasets of this scale in mind. It does a bunch of dispersion estimation processing that were only necessary in order to get around the small sample sizes typical of RNA-Seq datasets from that era (three years ago). With larger datasets, much of that processing is rendered moot.
But adapting the inner workings of the software to efficiently deal with large scale datasets is really a whole new project in its own right, and I don't think it's something that I have the free time for.
I also don't have a large RNA Seq dataset on hand....
Here's a quick and dirty script that I threw together that will pull a subset of genes out of the dataset and run on that subset. You would run these in parallel, setting batchID to 1:N, where N is the number of genes divided by batch size (rounded up).
#Set a batch id for each job from 1:N
batchID <- 1;
#Set this to a reasonable number. Say 2500 genes, maybe?
batchSize <- 2500;
library("DESeq2")
library("JunctionSeq")
#Load the dataset:
design <- data.frame(condition = factor(decoder$group.ID));
jscs <- readJunctionSeqCounts(
countfiles = countFiles,
samplenames = decoder$sample.ID,
design = design,
flat.gff.file = gff.file
);
#Estimate size factors (based on full dataset):
jscs <- estimateJunctionSeqSizeFactors(jscs);
#cut the dataset down to a smaller batch of genes:
geneList <- as.character(unique(fData(jscs)$geneID));
length(geneList)
batchGeneList <- geneList[(batchSize*(batchID-1)):min((batchSize*batchID-1),length(geneList))];
keep <- fData(jscs)$geneID %in% batchGeneList
sum(keep)
#Mess with all the internal references to cut the dataset down:
fData(jscs) <- fData(jscs)[keep,]
counts(jscs) <- counts(jscs)[keep,]
jscs@countVectors <- jscs@countVectors[keep,]
jscs@DESeqDataSet <- DESeqDataSetFromMatrix(countData = counts(jscs@DESeqDataSet)[keep,],
colData = jscs@DESeqDataSet@colData,
design = jscs@DESeqDataSet@design,
ignoreRank = TRUE);
#From here on out it's basically just a normal JunctionSeq run.
# All the modifications are just optional suggestions
jscs <- estimateJunctionSeqDispersions(jscs, nCores = 1);
#get the fitted dispersions, because the dispersion fit is still needed for
# effect size estimation, etc.
jscs <- fitJunctionSeqDispersionFunction(jscs);
#use "noShare" mode for the actual hypothesis tests:
jscs <- fitJunctionSeqDispersionFunction(jscs, method.dispFinal = "noShare");
jscs <- testForDiffUsage(jscs, nCores = 1);
jscs <- estimateEffectSizes( jscs, nCores = 1);
Super helpful, thank you. I'll report back.
@hartleys, Thanks again. Your script splitting up the genes worked great. Also, I installed a fresh R (3.4.2) and Bioconductor (3.6) and the original error associated with BiocParallel seems to be fixed. Not sure what the problem was. I was originally on R 3.4.1 and Bioconductor 3.5.
Hi,
I'm running JunctionSeq on 120 RNASeq samples (4 groups, 30 samples per group). The analysis failed with the following error:
I believe the primary error is the
Error in names(res) <- nms
.I verified my job had plenty of cores and ram. I provided 32 cores and allotted 20G per core (700G; see #30). The max_vmem for the job was only 452G.
I really appreciate any help you can provide.