gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
388 stars 78 forks source link

error: overlapping duplicate transcript feature #80

Closed TinaHenrich closed 8 years ago

TinaHenrich commented 8 years ago

Hi,

I was trying to analyse my RNAseq data using the Nature Protocol (Pertea et al 2016).

Now I've come across the following issue when trying to estimate transcript abundance (I'm using stringtie version 1.3.1c)

stringtie -e -B -G merged.gtf -o ballgown/myfile1/myfile1.gtf myfile1.bam

it starts creating the output folders including tmp folders, however, it aborts with the following error message: GFF Error: overlapping duplicate transcript feature (ID=NSGACT0000000582)

Thanks for any hints on to proceed.

gpertea commented 8 years ago

Could you please post (or e-mail me) the output of these commands:

head -2 merged.gtf
fgrep -w 'NSGACT0000000582' merged.gtf 

The first command should show the stringtie --merge command line (and version) which was used to generate that merged.gtf file. If the duplication happens there (which seems to be the case) I would really like to check some of the input gtf files there (for that locus).. So how many input files (samples) you had for that stringtie --merge command ? Perhaps I can provide instructions to extract all the transcripts for that locus from each of the input files.. (or from a subset of them, if you can reproduce it on a small(er) number of samples).

TinaHenrich commented 8 years ago

Thank you so much for your quick reply,

here is the output of the first command: stringtie --merge -G ensGene_revised.gff -o merged.gtf mergelist.txt StringTie version 1.3.1c

I've attached the output of the second command, looks like I have somehow produced 350 copies of that transcript.

Is it possible, that this issue comes from me converting the original gtf file to a gff file?

Just to make sure there is no mistake there, here are the heads of the gtf and gff files:

head -3 ensGene_revised.gtf chrI gasAcu1_ensGene stop_codon 8387386 8387388 0 - . gene_id ENSGACT00000012032; transcript_id ENSGACT00000012032; chrI gasAcu1_ensGene CDS 8387389 8387447 0 - 2 gene_id ENSGACT00000012032; transcript_id ENSGACT00000012032; chrI gasAcu1_ensGene exon 8387386 8387447 0 - . gene_id ENSGACT00000012032; transcript_id ENSGACT00000012032;

head -3 ensGene_revised.gff chrI gasAcu1_ensGene stop_codon 8387386 8387388 0 - . ID=NSGACT0000001203;GID=NSGACT0000001203 chrI gasAcu1_ensGene CDS 8387389 8387447 0 - 2 ID=NSGACT0000001203;GID=NSGACT0000001203 chrI gasAcu1_ensGene exon 8387386 8387447 0 - . ID=NSGACT0000001203;GID=NSGACT0000001203

again, thank you!

grep_mergedgtf.txt

gpertea commented 8 years ago

Not sure how you did the conversion (and why you even needed it, the protocol did not require it) but something is really wrong there with the IDs after the conversion -- it looks like the first and last character of the IDs were dropped in the process ?!

ENSGACT00000012032
 NSGACT0000001203

Obviously chopping the last character there would create duplicates.

Moreover, not sure how you got the ensGene_revised.gtf file, it looks like gene_id and transcript_id are identical but I think they shouldn't be, Ensembl gene_ids would have a ENSGACG prefix, only transcript_ids have the ENSGACT prefix..

Could you please run the stringtie --merge command with some unaltered (or valid) annotation file and let us know if you still have the 'duplicate transcript' error..

TinaHenrich commented 8 years ago

Ok, then this is where I have to start over and make sure I get a proper gtf file. This is one that was deposited in Dryad and is an updated version matching an updated genome assembly.

I did the conversion because the inverted commas were missing in the gtf in the first place. I will try to run the protocol on the ensembl versions of genome and gtf file and let you know if it works.

Thanks again for your input!

roberthaney commented 8 years ago

Hi,

I seem to be having the same problem. I did 8 separate stringtie runs for separate libraries, using the -G option and a .gff file with a genome annotation (this is the only annotation available). These runs seemed to complete with no problem even though the reference was gff. I attempted no conversion of gff to gtf, as I was under the impression that either gtf or gff could be used as a reference per the manual and I hadn't run into any problems using the gff reference until after attempting merge. I had also used the gff in a series of test runs with just the reference transcripts earlier and encountered no issues, including downstream differential expression analysis. But, when I merged the .gtf files produced with these runs with the original gff file and began stringtie runs to generate the .ctab output. The first run failed with:

GFF Error: overlapping duplicate transcript feature (ID=aug3.g8318.t2)

Here is the result of head -2 for the merged gtf:

stringtie --merge -p 6 -G /bigdata/garblab/rhaney/Ptep_DE_venomics/dovetail_genome/aug3.1_aa_ExonerateOutput_withID_corrected_final.gff -o /bigdata/garblab/rhaney/Ptep_DE_venomics/dovetail_genome/Ptep_dovetail_aug3.1Exonerate_8libStringtie.gtf /bigdata/garblab/rhaney/Ptep_DE_venomics/dovetail_genome/mergelist.txt

StringTie version 1.2.4

Here is the result of fgrep -w 'aug3.g8318.t2' with the merged gtf file:

Scf5bK3_1217 StringTie transcript 82305 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; Scf5bK3_1217 StringTie exon 82305 82437 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "1"; Scf5bK3_1217 StringTie exon 84040 84152 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "2"; Scf5bK3_1217 StringTie exon 89611 89633 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "3"; Scf5bK3_1217 StringTie exon 89909 90114 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "4"; Scf5bK3_1217 StringTie exon 105289 105420 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "5"; Scf5bK3_1217 StringTie exon 108264 108336 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "6"; Scf5bK3_1217 StringTie exon 114375 114494 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "7"; Scf5bK3_1217 StringTie exon 115660 115724 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "8"; Scf5bK3_1217 StringTie exon 119720 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "9"; Scf5bK3_1217 StringTie transcript 82305 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; ref_gene_id "aug3.g8318"; Scf5bK3_1217 StringTie exon 82305 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "1"; ref_gene_id "aug3.g8318";

Thanks for your assistance.

roberthaney commented 8 years ago

Not sure if this helps, but there appear to be 8 of these duplicated transcripts in the merged gtf, and in each case, they seem to have the same intervals but different intron/exon structure. Could these be alternative transcripts in the different libraries that are not being renamed? Only of the duplicates seems to be getting a ref_gene_id value.

gpertea commented 8 years ago

I would have to ask to re-run the current/last version of stringtie (1.3.1c) with your alignments, since we really need to be able to reproduce the reported potential bugs on the current version of the code. At least stringtie --merge and all stringtie -e -B commands should be re-run with the current version of Stringtie. Meanwhile, I am curious, what does a command like this return:

fgrep -w 'aug3.g8318.t2' aug3.1_aa_ExonerateOutput_withID_corrected_final.gff

I'm curious to see if this reference transcript isn't already duplicated somehow in your "reference annotation". The file name already indicates a somewhat suspicious origin (perhaps multiple Exonerate alignments might have the same ID in there?).

roberthaney commented 8 years ago

There is only a single entry for this transcript in the reference, so it does not seem to be duplicated therein. Below is the output for fgrep -w 'aug3.g8318.t2' aug3.1_aa_ExonerateOutput_withID_corrected_final.gff:

Scf5bK3_1217 protein2genome mRNA 82305 119775 . + . ID=aug3.g8318.t2;Name=aug3.g8318.t2;Parent=aug3.g8318 Scf5bK3_1217 protein2genome exon 82305 82437 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 82305 82437 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 84040 84152 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 84040 84152 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 89611 89633 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 89611 89633 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 89909 90114 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 89909 90114 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 105289 105420 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 105289 105420 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 108264 108336 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 108264 108336 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 114375 114494 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 114375 114494 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 115660 115724 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 115660 115724 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome exon 119720 119775 . + . Parent=aug3.g8318.t2 Scf5bK3_1217 protein2genome CDS 119720 119775 . + . Parent=aug3.g8318.t2

On the cluster on which I am running these analyses, we only have stringtie 1.2.4, so I will have have to ask the admins nicely to install the newer version, or maybe give the osX binary a try. Although I am lacking horsepower, perhaps --merge will run OK.

roberthaney commented 8 years ago

I got the Mac binary for 1.3.1 and ran --merge locally. When I use the merged gtf produced by 1.3.1 I get the same error:

GFF Error: overlapping duplicate transcript feature (ID=aug3.g8318.t2)

For the gtf file used in this run, head -n2 returns (with comment symbols removed due to markdown): stringtie --merge -G aug3.1_aa_ExonerateOutput_withID_corrected_final.gff -o Ptep_dovetail_aug3.1Exonerate_8libStringtie.gtf mergelist.txt StringTie version 1.3.1c

fgrep -w 'aug3.g8318.t2' for the merged gtf produces the following: Scf5bK3_1217 StringTie transcript 82305 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; Scf5bK3_1217 StringTie exon 82305 82437 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "1"; Scf5bK3_1217 StringTie exon 84040 84152 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "2"; Scf5bK3_1217 StringTie exon 89611 89633 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "3"; Scf5bK3_1217 StringTie exon 89909 90114 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "4"; Scf5bK3_1217 StringTie exon 105289 105420 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "5"; Scf5bK3_1217 StringTie exon 108264 108336 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "6"; Scf5bK3_1217 StringTie exon 114375 114494 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "7"; Scf5bK3_1217 StringTie exon 115660 115724 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "8"; Scf5bK3_1217 StringTie exon 119720 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "9"; Scf5bK3_1217 StringTie transcript 82305 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; ref_gene_id "aug3.g8318"; Scf5bK3_1217 StringTie exon 82305 119775 1000 + . gene_id "MSTRG.4622"; transcript_id "aug3.g8318.t2"; exon_number "1"; ref_gene_id "aug3.g8318";

--So it does seem to be producing two entries for this transcript ID.

Any ideas?

gpertea commented 8 years ago

Hm, I can't see why that whole-transcript exon is added there - though the evidence you presented so far seems to show that the --merge process is generating it.. It would be great if you could share with us all those 8 gtf files from your mergelist.txt and the reference annotation file aug3.1_aa_ExonerateOutput_withID_corrected_final.gff so we can reproduce and debug this problem here.

Of course if you could reproduce this duplication error with fewer files (say, by leaving out some of the lines in the mergelist.txt) that would make it quicker, but anyway it would be great if you can share, by any method you can (Google Drive, Dropbox etc.) an archive with a set of files which triggers the bug.

roberthaney commented 8 years ago

OK, I have put a bunch of gtf files produced with different mergelists in a dropbox folder that I am sharing with you. I tried dropping one library at a time (mergelists 2-9 and gtf 2-9) and the single exon aug3.g82818.t2 still occurred in every merged gtf. I also included all the gtfs produced by stringtie for each of our tissue libraries (start with Pt). Looking in these gtf, I see that every one has only the single exon transcript, while the reference has the multi-exon (in file aug3.g8318.t2_transcripts.txt). Since this single exon transcript was well below FPKM 1 in each library, I tried excluding it by setting -F 1 in the merge. Yet it still appears in the merged gtf (also in folder).

gpertea commented 8 years ago

Ah, I got it, it's the incorrect gene feature entry there in the reference file (aug3.g8318) at coordinates 18840-71971 which does not overlap the parented transcript aug3.g8318.t2 (coordinates 82305-119775) -- and somehow this minor discrepancy is throwing my GFF parser totally out of whack.. which is a shame, it shouldn't be so frail and it really shouldn't suddenly "see" duplicate transcripts there just because of this.. So this is definitely a bug on my side which I have to fix -- sorry for the inconvenience.. and thank you for providing the [slightly malformed] data which exposed this ridiculous bug!

roberthaney commented 8 years ago

Thanks! I have fixed the malformed intervals. Just a note that even with the correct intervals I am still getting one of these duplicated transcripts in the merge (GFF Error: overlapping duplicate transcript feature (ID=aug3.g22148.t2)) but this is now the only one.

Entries from the merged file:

Scf5bK3_3345 StringTie transcript 2740 29669 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; ref_gene_id "aug3.g22148"; Scf5bK3_3345 StringTie exon 2740 29669 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "1"; ref_gene_id "aug3.g22148";

Scf5bK3_3345 StringTie transcript 2740 29669 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; Scf5bK3_3345 StringTie exon 2740 2887 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "1"; Scf5bK3_3345 StringTie exon 4887 4932 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "2"; Scf5bK3_3345 StringTie exon 7678 7757 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "3"; Scf5bK3_3345 StringTie exon 9858 9922 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "4"; Scf5bK3_3345 StringTie exon 11728 11799 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "5"; Scf5bK3_3345 StringTie exon 11877 11900 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "6"; Scf5bK3_3345 StringTie exon 13187 13267 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "7"; Scf5bK3_3345 StringTie exon 23151 23201 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "8"; Scf5bK3_3345 StringTie exon 23315 23428 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "9"; Scf5bK3_3345 StringTie exon 25668 25757 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "10"; Scf5bK3_3345 StringTie exon 27620 27694 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "11"; Scf5bK3_3345 StringTie exon 29437 29517 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "12"; Scf5bK3_3345 StringTie exon 29631 29669 1000 - . gene_id "MSTRG.27274"; transcript_id "aug3.g22148.t2"; exon_number "13";

Entry from reference with correct interval for gene: Scf5bK3_21 protein2genome gene 2740 1864379 . + . ID=aug3.g22148;Name=aug3.g22148 Scf5bK3_21 protein2genome mRNA 1848591 1864379 . + . ID=aug3.g22148.t1;Name=aug3.g22148.t1;Parent=aug3.g22148 Scf5bK3_21 protein2genome exon 1848591 1848629 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1848591 1848629 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1848743 1848823 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1848743 1848823 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1853258 1853338 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1853258 1853338 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1856506 1856529 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1856506 1856529 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1856612 1856683 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1856612 1856683 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1858580 1858641 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1858580 1858641 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1860696 1860775 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1860696 1860775 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1861965 1862010 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1861965 1862010 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome exon 1864232 1864379 . + . Parent=aug3.g22148.t1 Scf5bK3_21 protein2genome CDS 1864232 1864379 . + . Parent=aug3.g22148.t1 Scf5bK3_3345 protein2genome mRNA 2740 29669 . - . ID=aug3.g22148.t2;Name=aug3.g22148.t2;Parent=aug3.g22148 Scf5bK3_3345 protein2genome exon 29631 29669 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 29631 29669 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 29437 29517 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 29437 29517 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 27620 27694 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 27620 27694 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 25668 25757 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 25668 25757 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 23315 23428 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 23315 23428 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 23151 23201 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 23151 23201 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 13187 13267 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 13187 13267 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 11877 11900 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 11877 11900 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 11728 11799 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 11728 11799 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 9858 9922 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 9858 9922 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 7678 7757 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 7678 7757 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 4887 4932 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 4887 4932 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome exon 2740 2887 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome CDS 2740 2887 . - . Parent=aug3.g22148.t2 Scf5bK3_3345 protein2genome mRNA 2740 29669 . - . ID=aug3.g22148.t3;Name=aug3.g22148.t3;Parent=aug3.g22148 Scf5bK3_3345 protein2genome exon 29631 29669 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 29631 29669 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 29437 29517 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 29437 29517 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 27620 27694 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 27620 27694 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 25668 25757 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 25668 25757 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 23315 23428 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 23315 23428 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 23151 23201 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 23151 23201 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 13187 13267 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 13187 13267 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 11877 11900 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 11877 11900 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 11728 11799 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 11728 11799 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 9858 9919 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 9858 9919 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 7678 7757 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 7678 7757 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 4887 4932 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 4887 4932 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome exon 2740 2887 . - . Parent=aug3.g22148.t3 Scf5bK3_3345 protein2genome CDS 2740 2887 . - . Parent=aug3.g22148.t3

gpertea commented 8 years ago

Actually that's still a structural inconsistency because the parent gene aug3.g22148 is only defined on contig Scf5bK3_21 even though it supposedly also parents aug3.g22148.t2 and aug3.g22148.t3 which are on a different contig (Scf5bK3_3345) and on the reverse strand there.. So again this made the old GFF parser go crazy when it cannot find the parent gene overlapping its transcripts. But this "parser-goes-crazy" bug was fixed in the latest commit, if you rebuild stringtie from the master branch now it should be OK even with the original reference file.

For the unpatched stringtie version (1.3.1c), this error could be avoided by correcting that reference file -- simply insert this "gene" line before its Scf5bK3_3345 transcripts:

Scf5bK3_3345    protein2genome  gene    2740    29669   .   -   .   ID=aug3.g22148;Name=aug3.g22148
roberthaney commented 8 years ago

Sorry I didn't even catch that. Was given this annotation and thought I had corrected the most egregious issues. I guess it was of use in making the merge function of stringtie highly robust!