gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
376 stars 78 forks source link

error: overlapping duplicate transcript feature with Stringtie 1.3.3b #163

Open PatrickKratsch opened 6 years ago

PatrickKratsch commented 6 years ago

Hi there,

I read over the last thread from 2016 that dealt with this issue already, but I am using Stringtie 1.3.3b (which has an updated GFF parser since the bug from back then), and get the following error:

GFF Error: overlapping duplicate transcript feature (ID=FBgn0013687)

My command: $STRINGTIE -p $CPUS -G $GTF -o $BASEDIR/assembled/$b.gtf $BASEDIR/BAM/$b ($b is a loop variable, which is the name of a BAM file)

I am not running --merge, I am getting this error for assembly.

grep 'FBgn0013687' 'Drosophila_melanogaster.BDGP6.91.gtf' returns:

mitochondrion_genome    FlyBase gene    14917   19524   .   +   .   gene_id "FBgn0013687"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene";
mitochondrion_genome    FlyBase transcript  14917   19524   .   +   .   gene_id "FBgn0013687"; transcript_id "FBgn0013687"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene"; transcript_source "FlyBase"; transcript_biotype "pseudogene";
mitochondrion_genome    FlyBase exon    14917   19524   .   +   .   gene_id "FBgn0013687"; transcript_id "FBgn0013687"; exon_number "1"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene"; transcript_source "FlyBase"; transcript_biotype "pseudogene"; exon_id "FBgn0013687-E1";
-bash-4.1$ grep 'FBgn0013687' Drosophila_melanogaster.BDGP6.91.gtf
mitochondrion_genome    FlyBase gene    14917   19524   .   +   .   gene_id "FBgn0013687"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene";
mitochondrion_genome    FlyBase transcript  14917   19524   .   +   .   gene_id "FBgn0013687"; transcript_id "FBgn0013687"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene"; transcript_source "FlyBase"; transcript_biotype "pseudogene";
mitochondrion_genome    FlyBase exon    14917   19524   .   +   .   gene_id "FBgn0013687"; transcript_id "FBgn0013687"; exon_number "1"; gene_name "mt:ori"; gene_source "FlyBase"; gene_biotype "pseudogene"; transcript_source "FlyBase"; transcript_biotype "pseudogene"; exon_id "FBgn0013687-E1";

It seems that gene, exon, and transcript all overlap. Is this a known issue, and do you have an idea about how to solve it?

Thanks a lot in advance.

Best,

Patrick

gpertea commented 6 years ago

Indeed, the fact that the gene and transcript features have the same ID caused the problem there (gene_id is taken as the ID for the gene feature). Thanks FlyBase for generating malformed GTF files once again.. :). The GFF3 version of this file would have exposed the problem even more glaringly (the same ID cannot be used for different features).

Anyway, I was just adjusting a few more things in/around GFF/GTF parsing, again.. I think I fixed this special case too. Please get the pre-release version 1.3.4a (that means: "alpha") and give it a whirl.. (fetch it from: http://ccb.jhu.edu/software/stringtie/#install ) This version should no longer stumble while parsing that GTF, but please let me know if you encounter any other problems.

PatrickKratsch commented 6 years ago

Thanks a lot for your quick reply, I will give 1.3.4alpha a go soon and let you know how it went. :) Again, really appreciate the quick response and help!

PatrickKratsch commented 6 years ago

Works! Using the 1.3.4 version fixed this issue. Thanks!