hartleys / QoRTs

Quality of RNA-Seq Toolset
51 stars 14 forks source link

possible Tophat error #38

Open LouisGil opened 7 years ago

LouisGil commented 7 years ago

Hi, Im preatty new in the field of bioinfomatics, and I having some issues runing the program. We used TopHat Aligner to create some bam files (I used the --MinMAPQ 50, because it was recomended). I dont understand if my problem is that Im using Tophat (which has a diferent way of doing the MAPQ)or if I'm having memory problems. Im ruming -Xmx16G and 20G and Im just runing a single bam file of 5GB and a gtf file of 92MB. My current output is only a log file and QC.QORTS_RUNNING. Any help is apreciated.

This is the log file: Starting QoRTs v1.2.42 (Compiled Fri Jun 2 12:23:55 EDT 2017) Starting time: (Thu Jun 08 15:28:03 EDT 2017) INPUT_COMMAND(QC) INPUT_ARG(infile)=/root/Desktop/17-0361.alignments.bam INPUT_ARG(gtffile)=/root/Desktop/17-0361.transcripts.gtf INPUT_ARG(outdir)=/root/Desktop/out3 INPUT_ARG(stranded)=true INPUT_ARG(minMAPQ)=50 INPUT_ARG(generatePlots)=true INPUT_ARG(verbose)=true Created Log File: /root/Desktop/out3/QC.CSFjk5Pk6PnI.log Starting QC [Time: 2017-06-08 15:28:03] [Mem usage: [1687kB / 18MB]] [Elapsed Time: 00:00:00.0000] QoRTs is Running in paired-end mode. QoRTs is Running in any-sorted mode. No parameter --genomeFA found. NOTE: Function "overlapMatch" requires function "mismatchEngine". Adding "mismatchEngine" to the active function list... Running functions: CigarOpDistribution, GCDistribution, GeneCalcs, InsertSize, JunctionCalcs, NVC, QualityScoreDistribution, StrandCheck, chromCounts, cigarLocusCounts, mismatchEngine, overlapMatch, readLengthDistro, writeBiotypeCounts, writeClippedNVC, writeDESeq, writeDEXSeq, writeGeneBody, writeGeneCounts, writeGenewiseGeneBody, writeJunctionSeqCounts, writeKnownSplices, writeNovelSplices, writeSpliceExon Checking first 10000 reads. Checking SAM file for formatting errors... Note: Detected TopHat Alignment Program. Version: "2.0.7" IMPORTANT NOTE: Detected TopHat Alignment Program, version > 2. TopHat v2+ uses a different MAPQ convention than most aligners. Make sure you set the --minMAPQ parameter to 50 if you want to ignore multi-mapped reads. Stats on the first 10000 reads: Num Reads Primary Map: 1371 Num Reads Paired-ended: 10000 Num Reads mapped pair: 1323 Num Pair names found: 1292 Num Pairs matched: 31 Read Seq length: 1 to 74 Unclipped Read length: 1 to 74 Final maxReadLength: 74 maxPhredScore: 36 minPhredScore: 14 NOTE: Read length is not consistent. In the first 10000 reads, read length varies from 1 to 74 (param maxReadLength=74) Note that using data that is hard-clipped prior to alignment is NOT recommended, because this makes it difficult (or impossible) to determine the sequencer read-cycle of each nucleotide base. This may obfuscate cycle-specific artifacts, trends, or errors, the detection of which is one of the primary purposes of QoRTs!In addition, hard clipping (whether before or after alignment) removes quality score data, and thus quality score metrics may be misleadingly optimistic. A MUCH preferable method of removing undesired sequence is to replace such sequence with N's, which preserves the quality score and the sequencer cycle information. WARNING WARNING WARNING: Read length is not consistent, AND "--maxReadLength" option is not set! QoRTs has ATTEMPTED to determine the maximum read length (74). It is STRONGLY recommended that you use the --maxReadLength option to set the maximum possible read length, or else errors may occur if/when reads longer than 74 appear. Note: Data appears to be paired-ended. Sorting Note: Reads are not sorted by name (This is OK). Sorting Note: Reads are sorted by position (This is OK). Done checking first 10000 reads. WARNINGS FOUND! Starting getSRPairIterResorted... SAMRecord Reader Generated. Read length: 74. [Time: 2017-06-08 15:28:05] [Mem usage: [40MB / 219MB]] [Elapsed Time: 00:00:01.0740]

Init GeneCalcs Utility Init InsertSize Utility Compiling flat feature annotation, internally in memory... FlatteningGtf: starting...(2017-06-08 15:28:28) FlatteningGtf: gtf file read complete.(2017-06-08 15:28:52) FlatteningGtf: Splice Junction Map read.(2017-06-08 15:28:53) FlatteningGtf: gene Sets generated.(2017-06-08 15:28:54) FlatteningGtf: Aggregate Sets built. FlatteningGtf: Compiling Aggregate Info . . . (2017-06-08 15:28:54) FlatteningGtf: Finished Compiling Aggregate Info. (2017-06-08 15:28:54) FlatteningGtf: Iterating through the step-vector...(2017-06-08 15:28:54) FlatteningGtf: Adding the aggregate genes themselves...(2017-06-08 15:28:55) FlatteningGtf: Iterating through the splice junctions...(2017-06-08 15:28:56) FlatteningGtf: Sorting the aggregate genes...(2017-06-08 15:28:59) FlatteningGtf: Folding the FlatGtfLine iterator...(2017-06-08 15:29:00) FlatteningGtf: Features Built.(2017-06-08 15:29:00) Internal flat feature annotation compiled! Init NVC utility Init CigarOpDistribution Utility Init QualityScoreDistribution Utility Init GC counts Utility Init JunctionCalcs utility length of knownSpliceMap after instantiation: 218191 length of knownCountMap after instantiation: 218191 Init StrandCheck Utility Init chromCount Utility Init qcCigarLocusCounts Utility Init OverlapMatch Utility Init MinorUtils Utility QC Utilities Generated! [Time: 2017-06-08 15:29:05] [Mem usage: [1936MB / 2392MB]] [Elapsed Time: 00:01:01.0225] NOTE: Unmatched Read-PAIR-Buffer Size > 100000 [Mem usage:[1450MB / 3557MB]] (This is generally not a problem, but if this increases further then OutOfMemoryExceptions may occur. If memory errors do occur, either increase memory allocation or sort the bam-file by name and rerun with the '--nameSorted' option. This might also indicate that your dataset contains an unusually large number of chimeric read-pairs. Or it could occur simply due to the presence of genomic loci with extremly high coverage or complex splicing. It may also indicate a SAM/BAM file that does not adhere to the standard SAM specification.) NOTE: Unmatched Read-PAIR-Buffer Size > 200000 [Mem usage:[2072MB / 3557MB]] NOTE: Unmatched Read Buffer Size > 100000 [Mem usage:[2066MB / 4053MB]] (This is generally not a problem, but if this increases further then OutOfMemoryExceptions may occur. If memory errors do occur, either increase memory allocation or sort the bam-file by name and rerun with the '--nameSorted' option. This might also indicate that your dataset contains an unusually large number of chimeric read-pairs. Or it could occur simply due to the presence of genomic loci with extremly high coverage. It may also indicate a SAM/BAM file that does not adhere to the standard SAM specification.) NOTE: Unmatched Read-PAIR-Buffer Size > 400000 [Mem usage:[2984MB / 4053MB]] NOTE: Unmatched Read-PAIR-Buffer Size > 800000 [Mem usage:[3980MB / 6GB]] NOTE: Unmatched Read Buffer Size > 200000 [Mem usage:[6GB / 11GB]] NOTE: Unmatched Read Buffer Size > 400000 [Mem usage:[8GB / 10GB]] NOTE: Unmatched Read-PAIR-Buffer Size > 1600000 [Mem usage:[7GB / 11GB]] NOTE: Unmatched Read-PAIR-Buffer Size > 3200000 [Mem usage:[13GB / 14GB]]


I belive this is from the verbose flag: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at net.sf.samtools.BinaryTagCodec.readNullTerminatedString(BinaryTagCodec.java:414) at net.sf.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:318) at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282) at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308) at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:288) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1566) at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:632) at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:618) at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:588) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:774) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:752) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:930) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:945) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:985) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:988) at internalUtils.stdUtils$$anon$2.hasNext(stdUtils.scala:372) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:193) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:406) at internalUtils.commonSeqUtils$$anon$5.addNextPairToBuffer(commonSeqUtils.scala:1038) at internalUtils.commonSeqUtils$$anon$5.next(commonSeqUtils.scala:1077) at internalUtils.commonSeqUtils$$anon$5.next(commonSeqUtils.scala:1026) at internalUtils.stdUtils$IteratorProgressReporter$$anon$5.next(stdUtils.scala:493) at scala.collection.Iterator$class.foreach(Iterator.scala:743) at internalUtils.stdUtils$IteratorProgressReporter$$anon$5.foreach(stdUtils.scala:487) at qcUtils.runAllQC$.runOnSeqFile(runAllQC.scala:1285) at qcUtils.runAllQC$.run(runAllQC.scala:939) at qcUtils.runAllQC$allQC_runner.run(runAllQC.scala:628) at runner.runner$.main(runner.scala:97) at runner.runner.main(runner.scala)

hartleys commented 7 years ago

Hmm. It looks like something is severely wrong with your alignment / bam file. When qorts looked at the first 10k reads, only 1300 were marked as having an aligned mate. And only 31 of those pairs had an aligned mate within the first 10k reads.

This means that aligned mates are really far apart and dont match up very often at all, meaning that qorts has to load a lot of unpaired pairs into memory, causing an eventual crash. But that shouldn't happen nearly this much. Something is extremely wrong here.

You could fix the qorts error by aligning your reads by name. But this doesnt deal with the fact that there is probably something catastrophically wrong with your alignment. Your read pairs shouldn't be so badly matched up. What species is this? Did you align to the wrong species? Did you mismatch the fastq files?

On Jun 8, 2017 3:56 PM, "Louis Gil" notifications@github.com wrote:

Hi, Im preatty new in the field of bioinfomatics, and I having some issues runing the program. We used TopHat Aligner to create some bam files (I used the --MinMAPQ 50, because it was recomended). I dont understand if my problem is that Im using Tophat (which has a diferent way of doing the MAPQ)or if I'm having memory problems. Im ruming -Xmx16G and 20G and Im just runing a single bam file of 5GB and a gtf file of 92MB. My current output is only a log file and QC.QORTS_RUNNING. Any help is apreciated.

This is the log file: Starting QoRTs v1.2.42 (Compiled Fri Jun 2 12:23:55 EDT 2017) Starting time: (Thu Jun 08 15:28:03 EDT 2017) INPUT_COMMAND(QC) INPUT_ARG(infile)=/root/Desktop/17-0361.alignments.bam INPUT_ARG(gtffile)=/root/Desktop/17-0361.transcripts.gtf INPUT_ARG(outdir)=/root/Desktop/out3 INPUT_ARG(stranded)=true INPUT_ARG(minMAPQ)=50 INPUT_ARG(generatePlots)=true INPUT_ARG(verbose)=true Created Log File: /root/Desktop/out3/QC.CSFjk5Pk6PnI.log Starting QC [Time: 2017-06-08 15:28:03] [Mem usage: [1687kB / 18MB]] [Elapsed Time: 00:00:00.0000] QoRTs is Running in paired-end mode. QoRTs is Running in any-sorted mode. No parameter --genomeFA found. NOTE: Function "overlapMatch" requires function "mismatchEngine". Adding "mismatchEngine" to the active function list... Running functions: CigarOpDistribution, GCDistribution, GeneCalcs, InsertSize, JunctionCalcs, NVC, QualityScoreDistribution, StrandCheck, chromCounts, cigarLocusCounts, mismatchEngine, overlapMatch, readLengthDistro, writeBiotypeCounts, writeClippedNVC, writeDESeq, writeDEXSeq, writeGeneBody, writeGeneCounts, writeGenewiseGeneBody, writeJunctionSeqCounts, writeKnownSplices, writeNovelSplices, writeSpliceExon Checking first 10000 reads. Checking SAM file for formatting errors... Note: Detected TopHat Alignment Program. Version: "2.0.7" IMPORTANT NOTE: Detected TopHat Alignment Program, version > 2. TopHat v2+ uses a different MAPQ convention than most aligners. Make sure you set the --minMAPQ parameter to 50 if you want to ignore multi-mapped reads. Stats on the first 10000 reads: Num Reads Primary Map: 1371 Num Reads Paired-ended: 10000 Num Reads mapped pair: 1323 Num Pair names found: 1292 Num Pairs matched: 31 Read Seq length: 1 to 74 Unclipped Read length: 1 to 74 Final maxReadLength: 74 maxPhredScore: 36 minPhredScore: 14 NOTE: Read length is not consistent. In the first 10000 reads, read length varies from 1 to 74 (param maxReadLength=74) Note that using data that is hard-clipped prior to alignment is NOT recommended, because this makes it difficult (or impossible) to determine the sequencer read-cycle of each nucleotide base. This may obfuscate cycle-specific artifacts, trends, or errors, the detection of which is one of the primary purposes of QoRTs!In addition, hard clipping (whether before or after alignment) removes quality score data, and thus quality score metrics may be misleadingly optimistic. A MUCH preferable method of removing undesired sequence is to replace such sequence with N's, which preserves the quality score and the sequencer cycle information. WARNING WARNING WARNING: Read length is not consistent, AND "--maxReadLength" option is not set! QoRTs has ATTEMPTED to determine the maximum read length (74). It is STRONGLY recommended that you use the --maxReadLength option to set the maximum possible read length, or else errors may occur if/when reads longer than 74 appear. Note: Data appears to be paired-ended. Sorting Note: Reads are not sorted by name (This is OK). Sorting Note: Reads are sorted by position (This is OK). Done checking first 10000 reads. WARNINGS FOUND! Starting getSRPairIterResorted... SAMRecord Reader Generated. Read length: 74. [Time: 2017-06-08 15:28:05] [Mem usage: [40MB / 219MB]] [Elapsed Time: 00:00:01.0740]

Init GeneCalcs Utility Init InsertSize Utility Compiling flat feature annotation, internally in memory... FlatteningGtf: starting...(2017-06-08 15:28:28) FlatteningGtf: gtf file read complete.(2017-06-08 15:28:52) FlatteningGtf: Splice Junction Map read.(2017-06-08 15:28:53) FlatteningGtf: gene Sets generated.(2017-06-08 15:28:54) FlatteningGtf: Aggregate Sets built. FlatteningGtf: Compiling Aggregate Info . . . (2017-06-08 15:28:54) FlatteningGtf: Finished Compiling Aggregate Info. (2017-06-08 15:28:54) FlatteningGtf: Iterating through the step-vector...(2017-06-08 15:28:54) FlatteningGtf: Adding the aggregate genes themselves...(2017-06-08 15:28:55) FlatteningGtf: Iterating through the splice junctions...(2017-06-08 15:28:56) FlatteningGtf: Sorting the aggregate genes...(2017-06-08 15:28:59) FlatteningGtf: Folding the FlatGtfLine iterator...(2017-06-08 15:29:00) FlatteningGtf: Features Built.(2017-06-08 15:29:00) Internal flat feature annotation compiled! Init NVC utility Init CigarOpDistribution Utility Init QualityScoreDistribution Utility Init GC counts Utility Init JunctionCalcs utility length of knownSpliceMap after instantiation: 218191 length of knownCountMap after instantiation: 218191 Init StrandCheck Utility Init chromCount Utility Init qcCigarLocusCounts Utility Init OverlapMatch Utility Init MinorUtils Utility QC Utilities Generated! [Time: 2017-06-08 15:29:05] [Mem usage: [1936MB / 2392MB]] [Elapsed Time: 00:01:01.0225] NOTE: Unmatched Read-PAIR-Buffer Size > 100000 [Mem usage:[1450MB / 3557MB]] (This is generally not a problem, but if this increases further then OutOfMemoryExceptions may occur. If memory errors do occur, either increase memory allocation or sort the bam-file by name and rerun with the '--nameSorted' option. This might also indicate that your dataset contains an unusually large number of chimeric read-pairs. Or it could occur simply due to the presence of genomic loci with extremly high coverage or complex splicing. It may also indicate a SAM/BAM file that does not adhere to the standard SAM specification.) NOTE: Unmatched Read-PAIR-Buffer Size > 200000 [Mem usage:[2072MB / 3557MB]] NOTE: Unmatched Read Buffer Size > 100000 [Mem usage:[2066MB / 4053MB]] (This is generally not a problem, but if this increases further then OutOfMemoryExceptions may occur. If memory errors do occur, either increase memory allocation or sort the bam-file by name and rerun with the '--nameSorted' option. This might also indicate that your dataset contains an unusually large number of chimeric read-pairs. Or it could occur simply due to the presence of genomic loci with extremly high coverage. It may also indicate a SAM/BAM file that does not adhere to the standard SAM specification.) NOTE: Unmatched Read-PAIR-Buffer Size > 400000 [Mem usage:[2984MB / 4053MB]] NOTE: Unmatched Read-PAIR-Buffer Size > 800000 [Mem usage:[3980MB / 6GB]] NOTE: Unmatched Read Buffer Size > 200000 [Mem usage:[6GB / 11GB]] NOTE: Unmatched Read Buffer Size > 400000 [Mem usage:[8GB / 10GB]] NOTE: Unmatched Read-PAIR-Buffer Size > 1600000 [Mem usage:[7GB / 11GB]] NOTE: Unmatched Read-PAIR-Buffer Size > 3200000 [Mem usage:[13GB / 14GB]]


I belive this is from the verbose flag: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at net.sf.samtools.BinaryTagCodec.readNullTerminatedString( BinaryTagCodec.java:414) at net.sf.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:318) at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282) at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308) at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:288) at net.sf.samtools.SAMRecord.isValid(SAMRecord.java:1566) at net.sf.samtools.BAMFileReader$BAMFileIterator.advance( BAMFileReader.java:632) at net.sf.samtools.BAMFileReader$BAMFileIterator.next( BAMFileReader.java:618) at net.sf.samtools.BAMFileReader$BAMFileIterator.next( BAMFileReader.java:588) at net.sf.samtools.SAMFileReader$AssertableIterator.next( SAMFileReader.java:774) at net.sf.samtools.SAMFileReader$AssertableIterator.next( SAMFileReader.java:752) at scala.collection.convert.Wrappers$JIteratorWrapper. next(Wrappers.scala:43) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator. scala:930) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:945) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:985) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:988) at internalUtils.stdUtils$$anon$2.hasNext(stdUtils.scala:372) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:193) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:406) at internalUtils.commonSeqUtils$$anon$5.addNextPairToBuffer( commonSeqUtils.scala:1038) at internalUtils.commonSeqUtils$$anon$5.next(commonSeqUtils.scala:1077) at internalUtils.commonSeqUtils$$anon$5.next(commonSeqUtils.scala:1026) at internalUtils.stdUtils$IteratorProgressReporter$$ anon$5.next(stdUtils.scala:493) at scala.collection.Iterator$class.foreach(Iterator.scala:743) at internalUtils.stdUtils$IteratorProgressReporter$$ anon$5.foreach(stdUtils.scala:487) at qcUtils.runAllQC$.runOnSeqFile(runAllQC.scala:1285) at qcUtils.runAllQC$.run(runAllQC.scala:939) at qcUtils.runAllQC$allQC_runner.run(runAllQC.scala:628) at runner.runner$.main(runner.scala:97) at runner.runner.main(runner.scala)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hartleys/QoRTs/issues/38, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwu7MgGzNHBakjmmSvXLoy1m8qdlfFIks5sCFHvgaJpZM4N0jDM .