hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
84 stars 21 forks source link

exclude clipped reads from pileups #352

Open timodonnell opened 9 years ago

timodonnell commented 9 years ago

We currently include clipped (either S or N cigar operator) reads in the pleup at a locus. For RNA-seq this is a performance issue (and arguably a programming gotcha) since often the vast majority of pileup elements at a locus will be clipped (with the N=intron CIGAR operator, i.e. the locus is not included in the gapped alignment for that read). I think we should change the semantics of the Pileup class to always exclude pileup elements that are clipped, since really they're not properly part of the pileup. Possibly the right way to do this is to modify MappedRead so it overrides the HasReferenceRegion trait's overlapsLocus method to return false if the given locus is in a clipped part of that read.

Additionally, if it's doable, we should modify pileupFlatmap and friends to not send reads to tasks if all the loci assigned to that task are clipped in the read.

arahuja commented 9 years ago

I could see the advantage of having the reads at the pileup as well since there is a difference between no reads mapped to that region or all the reads skipped that region, i.e. skipping an exon or different transcripts etc.

Doesn't pileupFlatMap handle this already with LociMap? If you want to look at exonic loci anyways we have have the --loci-from-file take a GTF file of exonic locations and run pileupFlatMap over those regions only?

timodonnell commented 9 years ago

I think it should be possible to get at the clipped reads in a pileup if you want to, but after dealing with RNA a bit in Guacamole I think by far the most common case is you're not interested in clipped bases, as they're not really "aligned" to that locus. Soft clipping is really just an artifact of how gapped alignments are (awkwardly) represented.

timodonnell commented 9 years ago

I think the right compromise may be to have the convenience functions on the Pileup class (depth, positiveDepth, alleleElements) exclude clipped reads, but not go the more extreme route and remove them from the Pileup altogether