Closed jan-glx closed 4 years ago
This looks like a bug in the GFF annotations, I don't see how one transcript could have overlapping exons?
More like a bug ("feature") in nature. Note the annotation "ribosomal_slippage" ( the corresponding genbank file has this note: "pp1ab; translated by -1 ribosomal frameshift"). Not sure how often such GFF appear but this one alone might be relevant enough to check if a simple fix is possible.
As a workaround I currently use this script to create a GFF from that .genbank file (one transcript per CDS):
#!/usr/bin/env python3
# based on work by Damien Farrell https://dmnfarrell.github.io/bioinformatics/bcftools-csq-gff-format
import sys
def GFF_bcftools_format(in_handle, out_handle):
"""Convert a bacterial genbank file from NCBI to a GFF3 format that can be used in bcftools csq.
see https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098.
Args:
in_file: genbank file
out_file: name of GFF file
"""
from BCBio import GFF
#in_handle = open(in_file)
#out_handle = open(out_file, "w")
from Bio.SeqFeature import SeqFeature
from Bio.SeqFeature import FeatureLocation
from copy import copy, deepcopy
from Bio import SeqIO
for record in SeqIO.parse(in_handle, "genbank"):
#make a copy of the record as we will be changing it during the loop
new = copy(record)
new.features = []
#loop over all features
for feat in record.features:
q = feat.qualifiers
#remove some unecessary qualifiers
for label in ['note','translation','product','experiment']:
if label in q:
del q[label]
if(feat.type == "CDS"):
#use the CDS feature to create the new lines
tag = q['gene'][0] #q['locus_tag'][0]
protein_id = q['protein_id'][0]
q['ID'] = 'CDS:%s' %protein_id
q['biotype'] = 'protein_coding'
for i, new_loc in enumerate(feat.location.parts if(hasattr(feat.location, "parts")) else (feat.location,)):
new_feat = deepcopy(feat)
tr_id = 'transcript:%s' %(protein_id+"_"+str(i))
new_feat.qualifiers['Parent'] = tr_id
new_feat.location = new_loc
new.features.append(new_feat)
#create mRNA feature
m = SeqFeature(feat.location, type='mRNA',strand=feat.strand)
q2 = m.qualifiers
q2['ID'] = tr_id
q2['Parent'] = 'gene:%s' %tag
q2['biotype'] = 'protein_coding'
new.features.append(m)
elif(feat.type == "gene"):
tag = q['gene'][0]
#edit the gene feature
q=feat.qualifiers
q['ID'] = 'gene:%s' %tag
q['biotype'] = 'protein_coding'
q['Name'] = q['gene']
new.features.append(feat)
#write the new features to a GFF
GFF.write([new], out_handle)
return
if __name__ == "__main__":
GFF_bcftools_format(sys.stdin, sys.stdout)
I don't understand what the GFF tells us about the translation of the exons 266-13468
and 266-13483
. The note suggests a -1 ribosomal slippage but in the GFF the exon is basically duplicated, just elongated by 15bp.
OK yes, bcftools csq
has two separate issues with this file:
It does not currently allow for a single CDS to consist of overlapping parts of the reference. This prevents correct variant effect prediction in the case of negative ribosomal slippage. There is no workaround possible since the predicted aminoacis changes would be incorrect. This is what I opened this issue for. The problem is in these lines (more precisely with 13468
occurring as both CDS end and start):
NC_045512.2 feature mRNA 266 21555 . + . ID=transcript:ORF1ab;Parent=gene:ORF1ab;biotype=protein_coding
NC_045512.2 feature gene 266 21555 . + . ID=gene:ORF1ab;Name=ORF1ab;biotype=protein_coding;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01
NC_045512.2 feature CDS 266 13468 . + 0 ID=CDS:YP_009724389.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009724389.1;ribosomal_slippage=
NC_045512.2 feature CDS 13468 21555 . + 0 ID=CDS:YP_009724389.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009724389.1;ribosomal_slippage=
It does not allow for multiple CDS on the same transcript (at least not if they overlap). This can also happen e.g. with polycistronic mRNA or with internal ribosome entry sites but is perhaps of minor interest and can easily be worked around by duplicating corresponding transcript.
The official (NCBI) GFF (below) further has the issue that it contains no transcript annotations at all & uses -
as separate within the ID fields, but this, again, can easily worked around and it is arguably outside the scope of this project to parse any GFF file.
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM985889v3
#!genome-build-accession NCBI_Assembly:GCF_009858895.2
##sequence-region NC_045512.2 1 29903
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049
NC_045512.2 RefSeq region 1 29903 . + . ID=NC_045512.2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus
NC_045512.2 RefSeq five_prime_UTR 1 265 . + . ID=id-NC_045512.2:1..265;gbkey=5'UTR
NC_045512.2 RefSeq gene 266 21555 . + . ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01
NC_045512.2 RefSeq CDS 266 13468 . + 0 ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1
NC_045512.2 RefSeq CDS 13468 21555 . + 0 ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1
NC_045512.2 RefSeq CDS 266 13483 . + 0 ID=cds-YP_009725295.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009725295.1,GeneID:43740578;Name=YP_009725295.1;Note=pp1a;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1a polyprotein;protein_id=YP_009725295.1
NC_045512.2 RefSeq stem_loop 13476 13503 . + . ID=id-GU280_gp01;Dbxref=GeneID:43740578;function=Coronavirus frameshifting stimulation element stem-loop 1;gbkey=stem_loop;gene=ORF1ab;inference=COORDINATES: profile:Rfam-release-14.1:RF00507%2CInfernal:1.1.2;locus_tag=GU280_gp01
NC_045512.2 RefSeq stem_loop 13488 13542 . + . ID=id-GU280_gp01-2;Dbxref=GeneID:43740578;function=Coronavirus frameshifting stimulation element stem-loop 2;gbkey=stem_loop;gene=ORF1ab;inference=COORDINATES: profile:profile:Rfam-release-14.1:RF00507%2CInfernal:1.1.2;locus_tag=GU280_gp01
NC_045512.2 RefSeq gene 21563 25384 . + . ID=gene-GU280_gp02;Dbxref=GeneID:43740568;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding;gene_synonym=spike glycoprotein;locus_tag=GU280_gp02
NC_045512.2 RefSeq CDS 21563 25384 . + 0 ID=cds-YP_009724390.1;Parent=gene-GU280_gp02;Dbxref=Genbank:YP_009724390.1,GeneID:43740568;Name=YP_009724390.1;Note=structural protein%3B spike protein;gbkey=CDS;gene=S;locus_tag=GU280_gp02;product=surface glycoprotein;protein_id=YP_009724390.1
NC_045512.2 RefSeq gene 25393 26220 . + . ID=gene-GU280_gp03;Dbxref=GeneID:43740569;Name=ORF3a;gbkey=Gene;gene=ORF3a;gene_biotype=protein_coding;locus_tag=GU280_gp03
NC_045512.2 RefSeq CDS 25393 26220 . + 0 ID=cds-YP_009724391.1;Parent=gene-GU280_gp03;Dbxref=Genbank:YP_009724391.1,GeneID:43740569;Name=YP_009724391.1;gbkey=CDS;gene=ORF3a;locus_tag=GU280_gp03;product=ORF3a protein;protein_id=YP_009724391.1
NC_045512.2 RefSeq gene 26245 26472 . + . ID=gene-GU280_gp04;Dbxref=GeneID:43740570;Name=E;gbkey=Gene;gene=E;gene_biotype=protein_coding;locus_tag=GU280_gp04
NC_045512.2 RefSeq CDS 26245 26472 . + 0 ID=cds-YP_009724392.1;Parent=gene-GU280_gp04;Dbxref=Genbank:YP_009724392.1,GeneID:43740570;Name=YP_009724392.1;Note=ORF4%3B structural protein%3B E protein;gbkey=CDS;gene=E;locus_tag=GU280_gp04;product=envelope protein;protein_id=YP_009724392.1
NC_045512.2 RefSeq gene 26523 27191 . + . ID=gene-GU280_gp05;Dbxref=GeneID:43740571;Name=M;gbkey=Gene;gene=M;gene_biotype=protein_coding;locus_tag=GU280_gp05
NC_045512.2 RefSeq CDS 26523 27191 . + 0 ID=cds-YP_009724393.1;Parent=gene-GU280_gp05;Dbxref=Genbank:YP_009724393.1,GeneID:43740571;Name=YP_009724393.1;Note=ORF5%3B structural protein;gbkey=CDS;gene=M;locus_tag=GU280_gp05;product=membrane glycoprotein;protein_id=YP_009724393.1
NC_045512.2 RefSeq gene 27202 27387 . + . ID=gene-GU280_gp06;Dbxref=GeneID:43740572;Name=ORF6;gbkey=Gene;gene=ORF6;gene_biotype=protein_coding;locus_tag=GU280_gp06
NC_045512.2 RefSeq CDS 27202 27387 . + 0 ID=cds-YP_009724394.1;Parent=gene-GU280_gp06;Dbxref=Genbank:YP_009724394.1,GeneID:43740572;Name=YP_009724394.1;gbkey=CDS;gene=ORF6;locus_tag=GU280_gp06;product=ORF6 protein;protein_id=YP_009724394.1
NC_045512.2 RefSeq gene 27394 27759 . + . ID=gene-GU280_gp07;Dbxref=GeneID:43740573;Name=ORF7a;gbkey=Gene;gene=ORF7a;gene_biotype=protein_coding;locus_tag=GU280_gp07
NC_045512.2 RefSeq CDS 27394 27759 . + 0 ID=cds-YP_009724395.1;Parent=gene-GU280_gp07;Dbxref=Genbank:YP_009724395.1,GeneID:43740573;Name=YP_009724395.1;gbkey=CDS;gene=ORF7a;locus_tag=GU280_gp07;product=ORF7a protein;protein_id=YP_009724395.1
NC_045512.2 RefSeq gene 27756 27887 . + . ID=gene-GU280_gp08;Dbxref=GeneID:43740574;Name=ORF7b;gbkey=Gene;gene=ORF7b;gene_biotype=protein_coding;locus_tag=GU280_gp08
NC_045512.2 RefSeq CDS 27756 27887 . + 0 ID=cds-YP_009725318.1;Parent=gene-GU280_gp08;Dbxref=Genbank:YP_009725318.1,GeneID:43740574;Name=YP_009725318.1;gbkey=CDS;gene=ORF7b;locus_tag=GU280_gp08;product=ORF7b;protein_id=YP_009725318.1
NC_045512.2 RefSeq gene 27894 28259 . + . ID=gene-GU280_gp09;Dbxref=GeneID:43740577;Name=ORF8;gbkey=Gene;gene=ORF8;gene_biotype=protein_coding;locus_tag=GU280_gp09
NC_045512.2 RefSeq CDS 27894 28259 . + 0 ID=cds-YP_009724396.1;Parent=gene-GU280_gp09;Dbxref=Genbank:YP_009724396.1,GeneID:43740577;Name=YP_009724396.1;gbkey=CDS;gene=ORF8;locus_tag=GU280_gp09;product=ORF8 protein;protein_id=YP_009724396.1
NC_045512.2 RefSeq gene 28274 29533 . + . ID=gene-GU280_gp10;Dbxref=GeneID:43740575;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=GU280_gp10
NC_045512.2 RefSeq CDS 28274 29533 . + 0 ID=cds-YP_009724397.2;Parent=gene-GU280_gp10;Dbxref=Genbank:YP_009724397.2,GeneID:43740575;Name=YP_009724397.2;Note=ORF9%3B structural protein;gbkey=CDS;gene=N;locus_tag=GU280_gp10;product=nucleocapsid phosphoprotein;protein_id=YP_009724397.2
NC_045512.2 RefSeq gene 29558 29674 . + . ID=gene-GU280_gp11;Dbxref=GeneID:43740576;Name=ORF10;gbkey=Gene;gene=ORF10;gene_biotype=protein_coding;locus_tag=GU280_gp11
NC_045512.2 RefSeq CDS 29558 29674 . + 0 ID=cds-YP_009725255.1;Parent=gene-GU280_gp11;Dbxref=Genbank:YP_009725255.1,GeneID:43740576;Name=YP_009725255.1;gbkey=CDS;gene=ORF10;locus_tag=GU280_gp11;product=ORF10 protein;protein_id=YP_009725255.1
NC_045512.2 RefSeq stem_loop 29609 29644 . + . ID=id-GU280_gp11;Dbxref=GeneID:43740576;function=Coronavirus 3' UTR pseudoknot stem-loop 1;gbkey=stem_loop;gene=ORF10;inference=COORDINATES: profile::Rfam-release-14.1:RF00165%2CInfernal:1.1.2;locus_tag=GU280_gp11
NC_045512.2 RefSeq stem_loop 29629 29657 . + . ID=id-GU280_gp11-2;Dbxref=GeneID:43740576;function=Coronavirus 3' UTR pseudoknot stem-loop 2;gbkey=stem_loop;gene=ORF10;inference=COORDINATES: profile::Rfam-release-14.1:RF00165%2CInfernal:1.1.2;locus_tag=GU280_gp11
NC_045512.2 RefSeq three_prime_UTR 29675 29903 . + . ID=id-NC_045512.2:29675..29903;gbkey=3'UTR
NC_045512.2 RefSeq stem_loop 29728 29768 . + . ID=id-NC_045512.2:29728..29768;Note=basepair exception: alignment to the Rfam model implies coordinates 29740:29758 form a noncanonical C:T basepair%2C but the homologous positions form a highly conserved C:G basepair in other viruses%2C including SARS (NC_004718.3);function=Coronavirus 3' stem-loop II-like motif (s2m);gbkey=stem_loop;inference=COORDINATES: profile:Rfam-release-14.1:RF00164%2CInfernal:1.1.2
###
Hope this clarifies the issue, I should have posted just the relevant lines of the GFF. Do you think that this can easily fixed, e.g. by deescalating the error to a warning?
I will check what can be done about the case of ribosomal slippage. Do you know how common is the first case?
The second case of polycistronic mRNA and internal ribosome entry sites is more complicated because the program operates under the simplifying assumption that one transcript = one protein product. However, as you say, that can be easily worked around by creating a new transcript.
Similar issue was raised twice in past (https://github.com/samtools/bcftools/issues/530#issuecomment-268278248, https://github.com/samtools/bcftools/issues/1078#issuecomment-527484831). To reiterate, I'd be happy to host a gff2gff
script in bcftools/misc
to help with conversion from the various flavors of gff into one that bcftools support.
Great! I have no idea how common it is that a CDS overlaps with itself.
Just like csq
won't accept all GFF flavors, I am afraid a universal gff2gff script would be hard to implement. But a link to #530 or a dedicated wiki page in related error messages might help...
It would have to be built slowly, case by case. Right now we'd have three flavors already. (There are two issues, the github markdown hid the second comment.)
This is now possible, please let me know if you encounter any problems.
bcftools csq (1.10.2) fails with "Error: CDS overlap in the transcript 0: 266-13468 and 266-13483" for below GFF file. Does this mean that overlapping transcripts are not suported? Could this easily be implemented?