Reads dropped in BAM file due to alignment to inconsistent strands

nlawlor commented 7 years ago

Hello,

I am using your software for stranded RNA-seq data to examine various QC variables and obtain gene expression count data.

Attached is my log file containing the commands I used to run the analysis: QC.sm15fFLgV6RW.log.txt

Briefly the parameters I used (as shown in log file) were:

java -Xmx50G -jar QoRTs.jar QC --generatePlots --RNA --stranded --addFunctions FPKM --genomeFA <ref_genome> <bam file> <gtf file> QoRTs/out

In my QC.summary.txt file (attached below) I noticed that only 9574 reads out of 25988490 total reads were passing QC. The majority of reads (24032924 reads) were being dropped due to "DROPPED_PAIR_STRANDS_MISMATCH". QC.summary.txt

Can you help me understand why so many reads are being "dropped" in my BAM file due to "DROPPED_PAIR_STRANDS_MISMATCH"? I used "STAR" to align the RNA-seq data and used the stranded options.

Please let me know if you need further information from me or other files from my analysis.

Thank you, Nathan

hartleys commented 7 years ago

I have never seen this before. That flag should only be raised when paired reads appear on the same strand. On all the protocols that I know of, paired-end reads should always appear on opposite strands from one another.

Can you attach the QoRTs log file? What sequencer and library prep protocol was used here?

Here's a link to some information on the different types of stranded paired-end reads.

Is it possible that one of the fastq files got reverse-complemented before alignment? Usually the bases in both paired fastq files should be in sequence-cycle order from left to right. Maybe in yours the second fastq was reverse complemented to put it on the same strand?

hartleys commented 7 years ago

If one of them did get reversed, it might be fairly obvious on the quality-score plots (although maybe not, since the only ones that would appear on this plot would be the oddballs). Can you attach the "qual.pair.median.png" plot?

nlawlor commented 7 years ago

I used the stranded TruSeq kit (which follows the "fr-unstranded" Illumina Standard) and believe this data was sequenced on an Illumina HiSeq 2500. I did not see a "qual.pair.median.png" plot in my output files, but I have attached the all QC summary plot PDF:

QC.multiPlot.pdf

hartleys commented 7 years ago

If it is unstranded, why are you using the "--stranded" option?

nlawlor commented 7 years ago

Whoops, sorry, I copied the wrong one. TruSeq stranded kit follows the "fr-firststrand" option.

hartleys commented 7 years ago

Hm. In that case I have no idea what could cause this. My only guess is that your pipeline must reverse complement one of the reads prior to alignment?

If you look at the bamfile using "samtools view" you should be able to see this in the SAM flags. Take a look at a few of the reads and read out the SAM flags by hand using this web tool. The sam flag is the 2nd column

So use a command like:

samtools view mybamfile.bam | less -S

And read off the 2nd column.

Usually, the "read reverse strand" and the "mate reverse strand" flags should always be opposite one another. Are they not, in your dataset?

nlawlor commented 7 years ago

I looked at a few of the reads in the BAM file and saw that the "read reverse strand" and the "mate reverse strand" flags were opposite of another.

For example for the two reads:

HWI-D00269:165:C8E8HANXX:8:1310:16700:49412 99 chr1 14453 255 125M = 14543 215

HWI-D00269:165:C8E8HANXX:8:1310:16700:49412 147 chr1 14543 255 125M = 14453 -215

The SAM flags of the first and second reads (99 and 147) seem to match up. This seemed to be the case for most of the read pairs (first 1000 lines of bam file is attached) bam.txt

I also used samtools to check how many reads appeared on the forward/reverse strands:

samtools view -F 0x10 mybam.bam -o forward.strand.reads.bam (reads not mapping to reverse strand) samtools view -f 0x10 mybam.bam -o reverse.strand.reads.bam (reads mapping to reverse strand)

Counting the total number of reads on each strand yielded: Forward: 26678516 Reverse: 26678516

nlawlor commented 7 years ago

After more investigation, I believe there must have been a problem with the library preparation steps. Repeating the library preparation and sequencing of this same sample yielded 0 reads dropped due to "DROPPED_PAIR_STRANDS_MISMATCH".

hartleys commented 7 years ago

That's really odd and interesting. Given the mechanics of how the sequencing happens, I would not think it would be possible for library prep to cause an issue like this.

Unless maybe the prep just failed catastrophically, and all the reads were all identical or just gibberish or something.

Did you try loading the bam file into IGV?

nlawlor commented 7 years ago

Yes I agree. We had the library preparation and sequencing performed at a separate facility so unfortunately, I am unable to determine what the exact error may have been in these steps.

I had 10 other biological replicates (same tissue type but from different individuals) that were also sequenced at this facility and they all showed the same issue (>90% of reads being dropped due to inconsistent strand alignment)... So I think it is safe to say these issues arose from prep/sequencing errors specific to this facility, especially because other public (and in-house) RNA-seq data I have ran through QoRTs do not produce this error.

For one particular biological sample, I have two BAM files, one generated from the outsourced facility (where > 90% reads are dropped due to strand alignment errors) and one from our in-house facility (0 reads dropped due to strand alignment errors). I looked at both of these BAM files in IGV, but did not see anything noticeably different about the read alignments.

hartleys / QoRTs

Reads dropped in BAM file due to alignment to inconsistent strands #42