hartleys / QoRTs

Quality of RNA-Seq Toolset
52 stars 14 forks source link

Ambiguous Insert size plot #41

Open frankbioinfo opened 6 years ago

frankbioinfo commented 6 years ago

Hello Developer,

Thank you developing this tools.

I have a question regarding insert size plot generated from QC.makemultiplot.pdf

Why does it have two peaks?

below is my command and please find attached plot.

QoRTs-STABLE.jar QC \ --generatePlots \ --stranded \ --keepMultiMapped \ --numThreads 10 \ SAMPLE.bam \ Mus_musculus.NCBIM37.67.gtf \ SAMPLE/"

bam was generated using hisat2 , fastq was generated from illumina trueseqstranded protocol (126bp-PE) and adapter was trimmed using fluxbar.

Thank you

screen shot 2017-08-16 at 11 00 13 am
hartleys commented 6 years ago

This is a common artifact usually caused by adapter trimming and/or the alignment. I'd have to look at the data a little more carefully to figure out the exact cause, but usually it happens when you trim the right-side adapters using a tool like fluxbar.

You can see the dip occurs right below 126bp. This represents the read-pairs that are more than fully overlapping, such that their right-side bases actually read into the opposite-strand adapter. These bases are hard-trimmed by fluxbar, meaning that QoRTs can no longer tell for sure how many bases were originally covered in the read (since for all it knows, trimming could have occurred on either end).

It's not actually a problem, just a quirk in how insert size is calculated for variable-length reads.

frankbioinfo commented 6 years ago

@hartleys could you please help to understand if it is advisable to use this data for downstream analysis or change trimming tool or a solution to avoid this many thanks.

hartleys commented 6 years ago

I don't see any problem with the data based on this plot.

It's just an artifact in the way that the insert size is calculated under these particular circumstances, not anything wrong with the data itself.