Closed mcsimenc closed 8 years ago
Could you identify one sample for which the assembled transcripts have this issue (unless it's only in the --merge output, which would be unlikely) - and if so is it possible to send me a BAM file with only the read alignments on that scaffold (or better yet, just alignments in the region covered by the transcript with the over-extended end coordinate) ? As you know this should be easily accomplished by using 'samtools view -b ...' after indexing the BAM file. I'd like to determine if it's a HISAT2 SAM output problem (incorrect CIGAR string generated) or a bug in the StringTie's reading of those peculiar falling-off-the-edge-of-the-reference alignments. Either way it should be easy to fix once I see a live example of it :)
The assembly for this sample has a feature from 15747-16321 on the scaffold scf7180000009721 which is 16303 long. Here is a BAM with reads aligned to that region (15650-16400). Thanks for the help, and let me know if you need anything else.
I cannot open the 1024 bytes .bam file from the attached 311 bytes zip file (?). How large is the .bam file you extracted ? Could you upload it somewhere else, please -- or you could send me a PM to gpertea@jhu.edu, I could give you directions for ftp upload to our server.
Whoops it must not have transferred from the cluster onto my computer correctly! This one should be the actual file.
The file was 64KB before zipping, but when I unzip it it becomes 372KB...but I tested it and it opens fine with samtools view.
OK I can see now that HISAT2 somehow did not apply any soft clipping to those few spliced alignments falling off the end of the scaffold there - so the CIGAR string is wrong. Instead, the last match was extended to the end of the read, even though there was no more reference sequence actually "matching" at the end there. And sure, StringTie took that incorrect exon length and ran with it.. This was HISAT2 v2.0.3-beta, perhaps @infphilo can clarify if this bug was ever reported and fixed since that release.. (thanks Daehwan)
I believe I fixed this issue last week - the fix is available in the master branch, and I plan to release the next version, v2.0.4, sometime soon (possibly this week).
Yay, thanks Daehwan!
FYI, I just wanted to let you know that I released HISAT2 2.0.4 today.
Awesome! thank you!
Hi,
I used hisat2 to map a few sets of RNA-seq reads to my reference genome, then used stringtie with all defaults to assemble, then used stringtie --merge to combine assemblies from different sets. I tried extracting nucleotide sequences for the merged assembly from my reference genome and found that the coordinates of at least one StringTie model went beyond the edge of the reference: the model coords are 13646-16322 but the scaffold length is only 16303. I thought maybe there was a StringTie setting I could turn on that regulates this behavior but I couldn't find one. I couldn't find a setting in hisat2 either but maybe that's where I need to look more, what do you think?
Thanks! Matt