gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
378 stars 78 forks source link

Superreads assigned to opposite strand #279

Open olawa opened 4 years ago

olawa commented 4 years ago

Hi, I am assembling a large collection of Illumna reads with the superreads scripts in order to make an (hopefully) improved transcriptome. It seems the output bam file has the superreads on the wrong strand which sometimes gives rise to false transcripts on the opposite strand of highly expressed genes. This is problematic not only because it creates noise but also because they can hide real transcripts with lower expression.

Is there a way to fix this with stringtie without having to change the sam files or re-do the mapping?

Skärmavbild 2020-06-11 kl  22 37 50
olawa commented 4 years ago

After a closer look the superreads are probably unstranded and the issue is more likely in the merging step - in this case a single-exon gene is highly expressed in some samples and a small portion of the reads are spliced and interfering with a real gene downstream.

My guess is the TPM estimation for stray transcripts from regions of high coverage is inflated and that it somehow gets bundled together with real transcripts even though they do not share any exons. This image illustrates one such case where the spliced short read is from the poly-A tail.

Skärmavbild 2020-06-11 kl  23 38 08
olawa commented 4 years ago

So the issue remains but is for 1) transcripts with last exon with only A or T should probably be flagged/removed and 2) for transcript merging - perhaps you could output filtered isoforms to a separate file (or is it possible to just run with -f 0?)