gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

StringTie model extends beyond edge of reference #46

Closed mcsimenc closed 8 years ago

mcsimenc commented 8 years ago

Hi,

I used hisat2 to map a few sets of RNA-seq reads to my reference genome, then used stringtie with all defaults to assemble, then used stringtie --merge to combine assemblies from different sets. I tried extracting nucleotide sequences for the merged assembly from my reference genome and found that the coordinates of at least one StringTie model went beyond the edge of the reference: the model coords are 13646-16322 but the scaffold length is only 16303. I thought maybe there was a StringTie setting I could turn on that regulates this behavior but I couldn't find one. I couldn't find a setting in hisat2 either but maybe that's where I need to look more, what do you think?

Thanks! Matt

gpertea commented 8 years ago

Could you identify one sample for which the assembled transcripts have this issue (unless it's only in the --merge output, which would be unlikely) - and if so is it possible to send me a BAM file with only the read alignments on that scaffold (or better yet, just alignments in the region covered by the transcript with the over-extended end coordinate) ? As you know this should be easily accomplished by using 'samtools view -b ...' after indexing the BAM file. I'd like to determine if it's a HISAT2 SAM output problem (incorrect CIGAR string generated) or a bug in the StringTie's reading of those peculiar falling-off-the-edge-of-the-reference alignments. Either way it should be easy to fix once I see a live example of it :)

mcsimenc commented 8 years ago

The assembly for this sample has a feature from 15747-16321 on the scaffold scf7180000009721 which is 16303 long. Here is a BAM with reads aligned to that region (15650-16400). Thanks for the help, and let me know if you need anything else.

Sample_14_scf7180000009721_15650-16400.sorted.bam.zip

gpertea commented 8 years ago

I cannot open the 1024 bytes .bam file from the attached 311 bytes zip file (?). How large is the .bam file you extracted ? Could you upload it somewhere else, please -- or you could send me a PM to gpertea@jhu.edu, I could give you directions for ftp upload to our server.

mcsimenc commented 8 years ago

Whoops it must not have transferred from the cluster onto my computer correctly! This one should be the actual file.

The file was 64KB before zipping, but when I unzip it it becomes 372KB...but I tested it and it opens fine with samtools view.

Sample_14_scf7180000009721_15650-16400.sorted.bam.zip

gpertea commented 8 years ago

OK I can see now that HISAT2 somehow did not apply any soft clipping to those few spliced alignments falling off the end of the scaffold there - so the CIGAR string is wrong. Instead, the last match was extended to the end of the read, even though there was no more reference sequence actually "matching" at the end there. And sure, StringTie took that incorrect exon length and ran with it.. This was HISAT2 v2.0.3-beta, perhaps @infphilo can clarify if this bug was ever reported and fixed since that release.. (thanks Daehwan)

infphilo commented 8 years ago

I believe I fixed this issue last week - the fix is available in the master branch, and I plan to release the next version, v2.0.4, sometime soon (possibly this week).

gpertea commented 8 years ago

Yay, thanks Daehwan!

infphilo commented 8 years ago

FYI, I just wanted to let you know that I released HISAT2 2.0.4 today.

mcsimenc commented 8 years ago

Awesome! thank you!