tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Using HipSTR on RNA-seq data #58

Closed guillermomarco closed 6 years ago

guillermomarco commented 6 years ago

Hello,

I'm pretty interested on using HipSTR on RNA-seq data. Do you have any experience? Would it be possible to use a different aligner like Tophat2/HiSat2?

tfwillems commented 6 years ago

Hi @guillermomarco, Apologies for the slow reply. I think applying HipSTR to RNA-seq is a great idea and something I've played around with a little bit. There could potentially be a few additional filtering criteria you'll need to apply to obtain reliable variant calls (minimum expression level, distance from exon-exon junction, etc.), but in theory it's a really cool idea.

Unfortunately, I've only really explored the use of BWA-MEM alignments with HipSTR, but using alignments from other aligners could (and should) work. HipSTR does use the AS, XS, XA and SA BAM tags BWA-MEM reports to ignore alignments that could potentially originate from many genomic locations. If you're interested in using alignments from these other aligners, I'd first verify these aligners report these tags in a manner that's consistent with BWA-MEM. If they don't, I'd be happy to work with you to tweak things such that HipSTR should run smoothly and correctly

Let me know how it goes!

Best, Thomas

guillermomarco commented 6 years ago

Hello @tfwillems , Indeed the answer was really fast not slow at all! I expected this answer so I've already mapped my RNA data with bwa-mem also to work with HipSTR. All the points you made are really helpful. Plus I appreciate the guide on how to build HipSTR reference for non reference genomes.

I currently don't know exactly which tags are employed by all the aligners. The RNAseq gold standard aligners nowdays are Tophat2/HISAT2/STAR, but since they allow split reads on splice events I've no idea what could be the effect on HipSTR.

omansn commented 3 years ago

Hi @tfwillems ,

I'd like to reopen this issue (or I can post a new one if you prefer). I too would like to use HipSTR with RNA-seq data. Likely your priorities have changed since 2018 when you offered to tweak HipSTR to accommodate other aligners. But if not, this would be incredibly useful for me. Here are some additional notes on the various RNA-seq aligners and their compatibility with HipSTR.

In my benchmarking of splice-aware aligners, I've found that Hisat2 performs best at reducing the INDEL false positive rate. This is largely due to Hisat2's unique ability of using a priori splice junction annotations of your choosing, rather than equally weighting all canonical GT/AG splice junctions. Unfortunately, Hisat2 is missing the XA and SA tags, and the XS tag contains different information than BWA-MEM. The other potential issue (as Guillermo pointed out) is that hisat2 and star alignments contain N cigar strings, which HipSTR complains about. It is possible to split N containing reads using gatk, but this creates two entries for each N read and hard clips before/after the splice junction, which likely renders the reads useless due to length. Minimap2 performs slightly less well than HiSAT2 but uses soft clipped supplementary alignments rather than N cigar strings to designate splice junctions.

Let me know what you think and if you have time, I'm happy to brainstorm.

Thanks, Nathan