Multiple gene names mapping to a single gene ID

crj32 commented 5 years ago

Hi

I have done the genome guided assembly on my data following the exact steps in the nature protocols paper and I have often multiple gene_names per single gene id in the merged .gtf file. Is this common? This must be incorrect? Because basically the tool has merged several genes together to make its own gene.....

Thanks,

Chris

chr20 StringTie exon 63734154 63734824 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000484569.1"; exon_number "1"; gene_name "ZGPAT"; ref_gene_id "ENSG00000197114.11"; chr20 StringTie exon 63735159 63735236 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000484569.1"; exon_number "2"; gene_name "ZGPAT"; ref_gene_id "ENSG00000197114.11"; chr20 StringTie transcript 63735463 63738441 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3"; chr20 StringTie exon 63735463 63735564 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "1"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3"; chr20 StringTie exon 63737845 63737902 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "2"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3"; chr20 StringTie exon 63737973 63738060 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "3"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3"; chr20 StringTie exon 63738183 63738441 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000496820.2"; exon_number "4"; gene_name "RP4-583P15.15"; ref_gene_id "ENSG00000273154.3"; chr20 StringTie transcript 63736283 63738234 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9"; chr20 StringTie exon 63736283 63736396 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "1"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9"; chr20 StringTie exon 63737533 63737647 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "2"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9"; chr20 StringTie exon 63737821 63737902 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "3"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9"; chr20 StringTie exon 63737973 63738060 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "4"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9"; chr20 StringTie exon 63738183 63738234 1000 + . gene_id "MSTRG.40027"; transcript_id "ENST00000444951.5"; exon_number "5"; gene_name "LIME1"; ref_gene_id "ENSG00000203896.9";

mrijnkels commented 5 years ago

Hi, we have the same issue. we see it for genes that are next to each other and genes that are several 100kb apart. Would really like to find out how to prevent this as it makes the merged stringtie file not very usefull

gpertea commented 5 years ago

This is a difficult issue to solve within StringTie, which makes assembly decisions based primarily on the read alignment data. Reference annotation is often imperfect and lacking, and in order to allow for the discovery of novel isoforms, StringTie always uses the read alignments as the basis of transcript assembly. Unfortunately read alignments can also be wrong/imperfect and may actually "bridge" neighboring genes, as it seems to be the case in the situations you are reporting here.

Using a better or more stringent read alignment strategy may help with this problem. Or some post-alignment filtering can be applied to the alignment data in order to eliminate large, low scoring alignments which seem to spuriously "connect" neighboring genes.

mrijnkels commented 5 years ago

So any suggestions on how to generate a better more stringent alignment strategy?

gpertea / stringtie

Multiple gene names mapping to a single gene ID #217