Stringtie merges paralogs

lassancejm commented 6 years ago

Hi,

I ran into an annoying issue when running Stringtie (version 1.3.4d) using alignments from Hisat2 (version 2.1.0).

This is an IGV screenshot illustrating the problem. The upper panel shows read alignment from a typical sample, the middle one corresponds to annotated genes (all members of a multi gene family and somewhat conserved in their 3' end), and the lower one is the output from stringtie merge (I ran Stringtie on individual samples and then ran merge).

As you can see, some predictions end up encompassing more than one gene. At this point, I am wondering whether there could be some issues with the mapping (I used Hisat2 with default parameters with exception of --dta --rna-strandness RF and no exon or splice-junction info were provided when building the genome index). Alternatively, are there parameters in Stringtie that would prevent this behavior?

Thanks for your insights!

gpertea commented 6 years ago

It looks like some of the read alignments extend across multiple genes, is that correct? So I presume the problem should appear even before the merge, at the sample assembly level -- due to those overextended alignments.. I think this is one of those cases where having a transcriptome mapping first (before the genome mapping) would help a lot. Tophat2 used to do that quite well but it was a slow & bulky solution, and now the closest thing to that is using Hisat2 with a genome index built with the --ss and --exon options, as you hinted. So indeed my advice would be to try that approach -- re-mapping with HISAT2 on a transcriptome-aware genome index. Hopefully that can fix the problem. StringTie operates primarily on read alignments as the main source of "evidence" when assembling the transcripts, and would not filter read alignments just because of mismatches/disagreements with the reference annotation..

lassancejm commented 6 years ago

Yes, it seems that some of the spliced reads connect the 3' regions of some genes, literally chaining them together. Per your advice, I will try rebuilding the hisat2 index and see if it improves things. Previously, I ran into some issues when trying to build the index with exon and splice-junction (I ran out of memory, even with 400Gb of RAM, yet the genome is human-size).

lassancejm commented 6 years ago

quick update: when trying to build the hisat2 index using the --ss and --exon options, I am (still) experiencing some out of memory error even though I am using 480Gb RAM... The genome is 2.6 Gb , and the splice_sites and exon files contain 293677 and 280334 lines, respectively. Any idea what could be wrong here?

gpertea commented 6 years ago

Unfortunately I have no clue about that, I haven't built such an index recently, and our servers where we would do that have at least 512GB RAM. This is a good question for @infphilo I would think.. You could probably submit an issue/question for HISAT2 about this.

lassancejm commented 6 years ago

Thanks for the info, I was looking for it but could not find where to post a question about Hisat2.

lassancejm commented 6 years ago

Unfortunately, it seems that several users reported the issues to @infphilo without getting an answer...

HaleDM commented 5 years ago

Hi, @lassancejm and @gpertea -- I want to weigh in here with a similar issue and some relevant information. I recently discovered a similar issue in my StringTie output -- after running 2-pass assembly on Hisat2 alignments, then using mstrg_prep.pl (to append ref_gene_name onto MSTRG gene_id) and prepDE.py to get raw counts, I discovered a collection of MSTRG gene ids with multiple ref_gene_name entries. Manually inspecting all transcripts for two of these instances revealed that multiple different ref_genes were being lumped under a single MSTRG gene ID. Blatting those transcript fastas confirmed this -- genes in close proximity in the genome were getting labeled as different transcripts of the same gene.

What's interesting is that, during the Hisat2 alignment process, I used both the -ss and -exon options, and still ran into this issue (but I could have made an error while generating those coordinate files).

gpertea commented 5 years ago

Since these situations are likely generated by alignments bridging neighboring genes through intergenic regions, there is little StringTie can do about that with the current (default) approach (of taking most of the alignment evidence at face value). But there might be ways to filter out this kind of "noisy" alignments. I'd say the "bridging" can happen in two ways:

alignments extending their "exonic" regions through the intergenic space until the terminal exons from neighboring genes end up overlapping (i.e. intergenic space seen as exonic)
one or more spurious spliced alignments suggest an intron straddling the intergenic space between the neighboring genes (i.e. intergenic space seen as intronic)

In the 1st case trimming should/might help but sometimes it cannot.. Perhaps if we implement some sort of reference "enforced" trimming that could remedy this. For the 2nd case a more stringent junction filter (increasing the -j value) could help clean up those spurious spliced alignments falsely connecting the genes. Not sure which of these situations are mostly responsible for this gene merging cases you observe in your data.

What do you observe more often in your case? If you look at the assembled transcripts in that region in IGV for example, or, perhaps easier, just at the merged ones (MSTRGs) there: is the intergenic space covered by an intron or by an exon of the MSTRG which spans those neighboring genes?

HaleDM commented 5 years ago

I would need to dig a little deeper to get a feel for which of your two proposed mechanisms is more common, but I see evidence for both (I think) with one such MSTRG gene_id that returned 3 unique ref_gene_names -- genes A and B appear to be bridged by an intron in the intergenic space, while genes B and C are bridged by a terminal exon in gene B overlapping the first exon of gene C.

Just by looking at the frequency of multiple gene names being returned with the mstrg_prep.pl, I don't think this is an alarming issue. Of 44,413 total MSTRG gene_ids, 26,518 have a gene name appended after the MSTRG id. Of these 26,000 named genes, only 645 appear to be bridged/have multiple ref_gene_names. Considering we're running DEG analysis using MSTRG identifiers, then linking those MSTRGs back to the gene name after tests are run, I'm not overly concerned about this affecting downstream analysis (we could always just manually inspect significant genes if they happen to be one of the 645 multiply named items). I guess the bridging of genes into transcripts could alter gene counts, though, which might complicate analysis, though.

Edit: Deleted a question concerning the difference between ref_gene_name and gene_name in .gtf output, as it has already been answered in a related thread.

mrijnkels commented 5 years ago

We are having the same issue, but I would definitely not want to run DEG with bridged genes as it will affect gene counts. that will not be easily teased out.

gpertea / stringtie

Stringtie merges paralogs #178