igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
644 stars 387 forks source link

Sashimi plot for paired-end reads #1146

Open cnzqy1 opened 2 years ago

cnzqy1 commented 2 years ago

I am working with sequencing data that has short insert sizes but was sequenced with 150 bp paired-end reads, leading to high levels of overlaps within the read pairs. When I visualized the data aligned by STAR in IGV with sashimi plot, I got the following:

Sashimi

Which comes from the following reads:

read pairs

However, you can see that these 4 reads come from 2 read pairs with 100% overlap within read pairs. Wouldn't counting the number of read pairs be more informative than counting the number of reads? In fact, in SJ.out.tab from STAR output, read pairs are counted (2 in this particular case).

jrobinso commented 2 years ago

Sorry I don't understand the point you are making.

cnzqy1 commented 2 years ago

Sorry I should have been more clear: is there an option to change the sashimi plot to count the number of read pairs with exon junctions instead of number of reads? If not is it possible to add such an option?

jrobinso commented 2 years ago

No its not. Everything is possible of course, no ETA on this as you can see from the number of open issues there is a lot on my plate.

cnzqy1 commented 2 years ago

No problem, thanks a lot!

jrobinso commented 2 years ago

This probably shouldn't even be an option, it should never double-count as it is now. However technically it is just counting reads. The situation you raise impacts regular dna coverage and anything else that relies on counting alignments at a location including snp calling. I think the best solution for this would be to filter what are really duplicate reads from the bam file itself.

cnzqy1 commented 2 years ago

Thanks for the follow-up! I wasn't thinking about any other applications but specifically for RNA-seq alignment and exon junction read counting. While filtering duplicate reads can help somewhat, there are also situations where the two reads in a pair aren't exactly duplicates (but with some overlapping part) and both span the same exon junction in the overlapping part, where IGV counts as 2. If the exon junction is not in the overlapping part then IGV counts as 1. In both cases however there is only 1 molecule spanning the exon junction that gives rise to the read pair, so from an expression level (i.e. number of molecules) standpoint it shouldn't be different where the exon junction is situated relative to the reads.

jrobinso commented 2 years ago

Yes understood, just pointing out that overlapping read pairs from short insert sizes cause other issues in analysis and visualization. These are technically not duplicates but effectively so, at least the overlapping part.