gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
378 stars 39 forks source link

CDS gets filtered out and its attributes overwrite attributes for another CDS in the same gene #112

Open hermidalc opened 2 years ago

hermidalc commented 2 years ago

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/545/GCA_000845545.1_ViralProj14434/GCA_000845545.1_ViralProj14434_genomic.gff.gz

When I run with -F --keep-exon-attrs to show what happens, these lines:

AF234533.1  Genbank gene    1405    2262    .   +   .   ID=gene-P;Name=P;gbkey=Gene;gene=P;gene_biotype=protein_coding
AF234533.1  Genbank mRNA    1405    2262    .   +   .   ID=rna-P;Parent=gene-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1  Genbank exon    1405    2262    .   +   .   ID=exon-P-1;Parent=rna-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1  Genbank CDS 1415    2251    .   +   0   ID=cds-AAG10410.1;Parent=rna-P;Dbxref=NCBI_GP:AAG10410.1;Name=AAG10410.1;gbkey=CDS;gene=P;product=polymerase-associated protein P;protein_id=AAG10410.1
AF234533.1  Genbank CDS 1692    1838    .   +   0   ID=cds-AAG10411.1;Parent=rna-P;Dbxref=NCBI_GP:AAG10411.1;Name=AAG10411.1;gbkey=CDS;gene=P;product=protein P';protein_id=AAG10411.1

get converted into this below, where CDS 1692..1838 get filtered out for some reason, and its attributes overwrite CDS 1405..2262 which is for a different protein (same gene):

AF234533.1  Genbank mRNA    1405    2262    .   +   .   ID=rna-P;geneID=gene-P;gene_name=P;gbkey=mRNA;gene=P;product=polymerase-associated protein P;Name=P;gene_biotype=protein_coding
AF234533.1  Genbank exon    1405    2262    .   +   .   Parent=rna-P;gbkey=mRNA;gene=P;product=polymerase-associated protein P
AF234533.1  Genbank CDS 1415    2251    .   +   0   Parent=rna-P;Dbxref=NCBI_GP:AAG10411.1;Name=AAG10411.1;gbkey=CDS;gene=P;product=protein P';protein_id=AAG10411.1
hermidalc commented 2 years ago

And in the same file again here:

AF234533.1  Genbank gene    6816    7453    .   +   .   ID=gene-alpha;Name=alpha;gbkey=Gene;gene=alpha;gene_biotype=protein_coding
AF234533.1  Genbank mRNA    6816    7453    .   +   .   ID=rna-alpha;Parent=gene-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1  Genbank exon    6816    7453    .   +   .   ID=exon-alpha-1;Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1  Genbank CDS 6824    7090    .   +   0   ID=cds-AAG10415.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1  Genbank CDS 7092    7442    .   +   0   ID=cds-AAG10416.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10416.1;Name=AAG10416.1;gbkey=CDS;gene=alpha;product=alpha 2 protein;protein_id=AAG10416.1
AF234533.1  Genbank CDS 7166    7321    .   +   0   ID=cds-AAG10417.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1

turns into this where again a CDS disappears and its attributes overwrite another CDS for a different protein (same gene):

AF234533.1  Genbank mRNA    6816    7453    .   +   .   ID=rna-alpha;geneID=gene-alpha;gene_name=alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein;Name=alpha;gene_biotype=protein_coding
AF234533.1  Genbank exon    6816    7453    .   +   .   Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1  Genbank CDS 6824    7090    .   +   0   Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1  Genbank CDS 7092    7442    .   +   0   Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1
gpertea commented 2 years ago

interesting - thank you for these reports related to parsing failures of this viral annotation.

In this particular case I can see that some CDSs are dropped when they are found to be "contained" in another CDS or having large overlaps with another CDS from the same transcript (so it's not a programmed ribosomal shift, which would be part of the same CDS).

It seems the error stems from the assumption that there can be at most one CDS (one chain of CDS segments) per transcript ID (i.e. one protein per transcript), which is clearly not the case in this annotation -- each CDS segment here, even though parented by the same transcript ID, seems to be a distinct coding sequence and thus leading to different protein products (!).

Currently the transcript data structure I am using only keeps track of one CDS segment chain per transcript, and changing that will be quite impactful in other downstream code I am using. Perhaps the easiest/least impactful workaround would be for the parser to emit two transcripts in such cases (and thus treat these distinct CDSs as distinct transcript "isoforms" from the same gene..). This workaround might actually be useful with other bioinformatics software that may have also assumed "one transcript => one protein".