gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

One transcript with several gene names #159

Open manuelsmendoza opened 6 years ago

manuelsmendoza commented 6 years ago

Hello!

I am trying to analyse differential gene expression in 83 samples split up into three groups following the protocol described by Pertea et al., 2016 (doi:10.1038/nprot.2016.095).

Now I have two CSV files for each pairwise comparison, one with the transcripts results and other with gene results. I have assigned a column with the gene name to the gene results file following Freeze's instructions (https://www.biostars.org/p/218136/).

A lot of rows have a dot "." as gene name... I suppose that it means "unknown gene". Is it correct? But when I look for these transcripts in the transcripts results file I find that the same transcript has multiple gene names. What's happen?

A technical question: What does StringTie do when a transcript matches with multiple exomes/genes? How does StringTie count it?

Gene results:

    geneNames feature          id        fc         pval       qval       exp_sig
1385            .    gene MSTRG.18605 0.3612696 8.836918e-05 0.02780747 Downregulated
2009            .    gene  MSTRG.2251 0.3705158 1.723619e-04 0.03990178 Downregulated
3565            .    gene MSTRG.31880 0.2855206 2.766333e-05 0.02210375 Downregulated
3577            .    gene  MSTRG.3192 0.4190300 2.500196e-04 0.04590761 Downregulated
7730            .    gene MSTRG.52616 0.4974403 1.902062e-04 0.04187998 Downregulated
8791 LOC102724999    gene MSTRG.57635 0.4391518 7.886925e-05 0.02780747 Downregulated
8833         RRM2    gene  MSTRG.5791 0.1491026 9.653982e-06 0.01517076 Downregulated
8941            .    gene MSTRG.58419 0.4839879 2.913421e-04 0.04883999 Downregulated
9248            .    gene  MSTRG.7286 0.4853837 4.294071e-05 0.02455956 Downregulated

Extract info about MSTRG.18605 in transcript file

          geneNames     geneIDs    feature    id        fc        pval      qval
12632         . MSTRG.18605 transcript 92336 0.8248940 0.348378059 0.8105198
12634         . MSTRG.18605 transcript 92338 0.5838309 0.180876736 0.7302217
12635         . MSTRG.18605 transcript 92339 0.4058340 0.051912182 0.5780383
12636         . MSTRG.18605 transcript 92340 0.3070723 0.002123351 0.2270620
12637         . MSTRG.18605 transcript 92341 0.8145933 0.510077645 0.8781143
12638     HLA-C MSTRG.18605 transcript 92342 0.1351463 0.002860136 0.2504040
12639         . MSTRG.18605 transcript 92343 1.0737243 0.810680044 0.9544281
12640         . MSTRG.18605 transcript 92344 0.4356676 0.012957412 0.3988734
12641         . MSTRG.18605 transcript 92345 0.5661032 0.014102596 0.4047605
12642         . MSTRG.18605 transcript 92346 0.4012696 0.048469168 0.5719337
12643         . MSTRG.18605 transcript 92347 0.2884399 0.004146397 0.2847510
12644         . MSTRG.18605 transcript 92348 0.3167699 0.033571700 0.5162445
12645         . MSTRG.18605 transcript 92349 0.3484434 0.017986318 0.4343761
12650         . MSTRG.18605 transcript 92355 1.1018119 0.684596561 0.9242470
12656         . MSTRG.18605 transcript 92362 1.2783697 0.407525370 0.8302648
12657         . MSTRG.18605 transcript 92363 0.6356763 0.383143717 0.8199931
12658         . MSTRG.18605 transcript 92364 0.9109425 0.752058633 0.9371627
12659         . MSTRG.18605 transcript 92365 0.7067905 0.098177182 0.6499381
12661         . MSTRG.18605 transcript 92367 0.7979155 0.371616108 0.8166484
12663         . MSTRG.18605 transcript 92372 0.8521393 0.347975902 0.8105198
12664         . MSTRG.18605 transcript 92373 1.1114767 0.630943495 0.9137496
12665         . MSTRG.18605 transcript 92374 1.1819176 0.227935366 0.7639951
12667         . MSTRG.18605 transcript 92376 0.8134280 0.242299301 0.7643093
12668         . MSTRG.18605 transcript 92377 0.8685366 0.616908528 0.9100701
12669     HLA-B MSTRG.18605 transcript 92378 0.4077784 0.080955383 0.6381334
12670         . MSTRG.18605 transcript 92380 1.0450625 0.892268294 0.9766652
12671     HLA-B MSTRG.18605 transcript 92381 0.8218658 0.667177666 0.9193687
12672     HLA-B MSTRG.18605 transcript 92382 0.7802102 0.441900921 0.8454283
12673         . MSTRG.18605 transcript 92383 1.1639513 0.778891293 0.9450891
12675         . MSTRG.18605 transcript 92386 0.5742808 0.170972054 0.7188115
12676   MIR6891 MSTRG.18605 transcript 92387 0.7063845 0.054794258 0.5844075
12677   MIR6891 MSTRG.18605 transcript 92389 1.3352687 0.217143588 0.7592203
gpertea commented 6 years ago

Firstly, please note that the MSTRG.# format you show in the tables above are not transcript IDs, they are gene (locus) IDs (the transcript IDs will have a different format, with another number added: MSTRG.#.#).

Currently StringTie is very strict with assigning reference gene names to the assembled transcripts and it only does it for those fully (exactly) matching reference transcripts. So all those alternate transcripts will get the dot "." instead in those tables.. which is obviously not informative at all. I agree that it would be a good idea to assign such gene names to those alternate transcript assemblies as long as they overlap a "known" reference transcript.. Of course there will be "novel" assemblies which might not overlap any reference transcript/gene so they will still get the dot "." reported there..

The other issue you raised is StringTie gene/loci (MSTRG.#) matching multiple reference gene names (as explained above, you incorrectly suggested "same transcript has multiple gene names" which is not the case, those MSTRG.# IDs are gene IDs not transcript IDs.)

This is actually the other reason why StringTie uses its own locus/gene IDs instead of reference gene IDs and names: because there are complex gene clusters like these where the alignments extend across multiple reference loci/genes, so the resulting StringTie locus(gene) encompasses multiple reference genes.. This could be caused by spurious alignments which are somewhat expected in situations like HLA gene clusters, or even intra-intron miRNA genes (like MIR6891) which are easily found to overlap incompletely spliced (or the retained introns of the) transcripts encompassing them (HLA transcripts in this example), and thus they'll be assigned the same StringTie gene/locus ID (MSTRG.18605 in this example).

And in such "merged" reference loci cases there could be assembled transcripts overlapping reference transcripts belonging to different reference genes..(thus "bridging" those genes). What should StringTie show as the gene name in such cases? Both gene names? So you see, things can get hairy and that's why sticking to its own internally generated locus IDs like MSTRG.18605 is the unambiguous way for StringTie to report the results.

brightbio commented 6 years ago

what if analysing differential gene expression following the protocol "An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps (depicted below) as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step. "? avoid the case that One internally generated locus IDs like MSTRG.18605 with several gene names ? qq 20180308181127

manuelsmendoza commented 6 years ago

I am not sure about that but, I think not... because if you see at the table, the locus MSTRG.18605 has two gene IDs i.e. HLA-B and HLA-C (referenced in the human genome) and the sequences of both HLA are overlapped...

brightbio commented 6 years ago

As is known, StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. The flowchart of the protocol described by Pertea et al., 2016 (doi:10.1038/nprot.2016.095) is as following : qq 20180315134000 whether this flowchart include the optional de novo assembly step ? and leading to the issue that one StringTie gene/loci (MSTRG.#) matching multiple reference gene names ? qq    @@@20180315135358@gpertea

shreygandhi1990 commented 6 years ago

What about the novel transcripts in such cases? Suppose, if a novel transcript is assigned MSTRG.ID loci and that ID has multiple gene names associated with it (just like in the case above). In such a case, how to figure out which gene_name does that transcript belong to?

manuelsmendoza commented 6 years ago

Maybe an option is to extract this transcript and make blast? StringTie should be very strict in those cases because it is not able to distinguish which gene this transcript belongs to... no?

gpertea commented 6 years ago

I agree that the annotation of StringTie's merged output is somewhat lacking at this point.. I am considering improving that in a future version by adding one ore more GTF tags that would actually show which reference gene_id (or gene_name, if the user prefers that) any particular StringTie output transcript might belong to (just by overlap, not by fully matching a reference transcript). Currently the user should use third party programs in order to actually annotate those cryptic MSTRG.ID loci and MSTRG.ID.ID transcripts in some cases. GffCompare can address some of it (with its "annotation mode") but I guess it still does not do it quite conveniently enough for easy inclusion in analysis pipelines. I'll leave this topic open (and of course take suggestions) until I get around implement (or find) a solution to properly address these needs of gene annotation of StringTie's output..

sneha100895 commented 5 years ago

Hi, I am using stringtie to assemble RNA-seq reads and I have the same issue where one MSTRG locus overlaps with multiple reference loci. Could you please let me know what approaches I can take to annotate these MSTRG loci?

I've already tried gffcompare and it din't help

kvittingseerup commented 3 years ago

Hi Everybody.

I have been having the same problem for a while and therefore I tried solving it as described below. There are two overlapping problems at work here:

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to these problems (the rest are novel genes).

To solve this problem I implemented a "rescue" algorithm which simply:

  1. Assign gene_names to transcripts. For cases when this involves "merged" genes transcripts are assigned by overlap i genomic coordinates (ignoring introns) as long as the largest overlap is X times larger than the second largest overlap (where X is a user-supplied argument with very sensible defaults estimated from the overlapping data existing in the human Gencode34 annoation).
  2. Split merged genes by their gene_names.

The solution is implemented in my R package IsoformSwitchAnalyzeR (available in >= 1.13.01). You simply use the importRdata() function to import the GTF file into R. During this import it will rescue the isoform annotation (which can be rescued) and clean up the rest of the annotation. From the resulting switchAnalyzeRlist object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR (as exemplified here) or use the extractGeneExpression() function to get a gene expression/count matrix for analysis with other tools. This count/expression matrix will be equivalent to what is obtained by using txmeta or tximeta except it will be corrected for the merged gene problem and fully annotated with gene_names.

Hope this helps.

Cheers

Kristoffer

NB: Please post any questions about (or problems with) this approach on the IsoformSwitchAnalyzeR github or google group as it does not belong here on the StringTie github.