gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
373 stars 39 forks source link

How is mixed-strand trans-splicing handled? #25

Open fruce-ki opened 6 years ago

fruce-ki commented 6 years ago

I've been using an old version of gffread bundled with cufflinks 2.2.1 to create the transcript sequences given a GTF and genome FASTA. Recently I stumbled on trans-splicing from transcripts originating in opposite strands and realised that gffread was creating two separate sequences out of such annotations, causing problems downstream in my analyses. I was wondering if this is something that has been fixed since that old version.

Here are a couple of such annotation examples in GTF:

3R  protein_coding  exon    17202324    17202463    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "1"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R  protein_coding  exon    17177331    17177608    .   +   .    gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R  protein_coding  exon    17203010    17203121    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "3"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R  protein_coding  exon    17202541    17202798    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R  protein_coding  start_codon 17202752    17202754    .   -   0    gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";
3R  protein_coding  exon    17200782    17201634    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084077"; exon_number "5"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RR";

3R  protein_coding  exon    17203010    17203121    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "1"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R  protein_coding  exon    17202541    17202798    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R  protein_coding  CDS 17202541    17202754    .   -   0    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
3R  protein_coding  start_codon 17202752    17202754    .   -   0    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "2"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R  protein_coding  exon    17202324    17202463    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "3"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R  protein_coding  CDS 17202324    17202463    .   -   2    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "3"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
3R  protein_coding  exon    17200782    17201634    .   -   .    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R  protein_coding  CDS 17200782    17201634    .   -   0    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "4"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
3R  protein_coding  exon    17177761    17178358    .   +   .    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "5"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA";
3R  protein_coding  CDS 17177761    17178011    .   +   0    gene_id "FBgn0002781"; transcript_id "FBtr0084060"; exon_number "5"; gene_name "mod(mdg4)"; gene_biotype "protein_coding"; transcript_name "mod(mdg4)-RA"; protein_id "FBpp0083459";
gpertea commented 6 years ago

Indeed this hasn't been fixed even in the current version, but it definitely should be addressed. I'll take a look at this soon.

gpertea commented 6 years ago

Extending this all the way to distant and inter-chromosomal trans-splicing would break the locality assumptions (the "locus" concept). Clustering transcripts into loci would become messy without splitting these distantly trans-spliced transcripts, like I've been doing.. I guess the way to make gffread work by default is to preserve trans-splicing for simple operations like conversions, filters and transcript sequence extraction, but still fall back to the old transcript splitting for any clustering and locus-based operations.. (this is more of a commentary/note to self)

fruce-ki commented 6 years ago

The example here was from trans-splicing between sense and antisense transcripts at the same locus. For the record, however, it originates in an old version of the annotation and is not present in newer versions of it, so it was likely an error. This could be more of a theoretical bug than a practical one, but I do not know anything about trans-splicing, I just stumbled on it as a bug in my workflow.

jpetittowpi commented 4 years ago

Has this issue been addressed? I am working with lncRNA that trans-splice, sense and anti-sense, from the same mitochondrial gene.