Open hermidalc opened 2 years ago
And in the same file again here:
AF234533.1 Genbank gene 6816 7453 . + . ID=gene-alpha;Name=alpha;gbkey=Gene;gene=alpha;gene_biotype=protein_coding
AF234533.1 Genbank mRNA 6816 7453 . + . ID=rna-alpha;Parent=gene-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1 Genbank exon 6816 7453 . + . ID=exon-alpha-1;Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1 Genbank CDS 6824 7090 . + 0 ID=cds-AAG10415.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1 Genbank CDS 7092 7442 . + 0 ID=cds-AAG10416.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10416.1;Name=AAG10416.1;gbkey=CDS;gene=alpha;product=alpha 2 protein;protein_id=AAG10416.1
AF234533.1 Genbank CDS 7166 7321 . + 0 ID=cds-AAG10417.1;Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1
turns into this where again a CDS disappears and its attributes overwrite another CDS for a different protein (same gene):
AF234533.1 Genbank mRNA 6816 7453 . + . ID=rna-alpha;geneID=gene-alpha;gene_name=alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein;Name=alpha;gene_biotype=protein_coding
AF234533.1 Genbank exon 6816 7453 . + . Parent=rna-alpha;gbkey=mRNA;gene=alpha;product=alpha 1 protein
AF234533.1 Genbank CDS 6824 7090 . + 0 Parent=rna-alpha;Dbxref=NCBI_GP:AAG10415.1;Name=AAG10415.1;gbkey=CDS;gene=alpha;product=alpha 1 protein;protein_id=AAG10415.1
AF234533.1 Genbank CDS 7092 7442 . + 0 Parent=rna-alpha;Dbxref=NCBI_GP:AAG10417.1;Name=AAG10417.1;gbkey=CDS;gene=alpha;product=alpha 3 protein;protein_id=AAG10417.1
interesting - thank you for these reports related to parsing failures of this viral annotation.
In this particular case I can see that some CDSs are dropped when they are found to be "contained" in another CDS or having large overlaps with another CDS from the same transcript (so it's not a programmed ribosomal shift, which would be part of the same CDS).
It seems the error stems from the assumption that there can be at most one CDS (one chain of CDS segments) per transcript ID (i.e. one protein per transcript), which is clearly not the case in this annotation -- each CDS segment here, even though parented by the same transcript ID, seems to be a distinct coding sequence and thus leading to different protein products (!).
Currently the transcript data structure I am using only keeps track of one CDS segment chain per transcript, and changing that will be quite impactful in other downstream code I am using. Perhaps the easiest/least impactful workaround would be for the parser to emit two transcripts in such cases (and thus treat these distinct CDSs as distinct transcript "isoforms" from the same gene..). This workaround might actually be useful with other bioinformatics software that may have also assumed "one transcript => one protein".
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/845/545/GCA_000845545.1_ViralProj14434/GCA_000845545.1_ViralProj14434_genomic.gff.gz
When I run with
-F --keep-exon-attrs
to show what happens, these lines:get converted into this below, where CDS 1692..1838 get filtered out for some reason, and its attributes overwrite CDS 1405..2262 which is for a different protein (same gene):