[FR] overlapping CDS - Githubissues

jan-glx commented 4 years ago

bcftools csq (1.10.2) fails with "Error: CDS overlap in the transcript 0: 266-13468 and 266-13483" for below GFF file. Does this mean that overlapping transcripts are not suported? Could this easily be implemented?

##gff-version 3
##sequence-region NC_045512.2 1 29903
NC_045512.2 annotation  remark  1   29903   .   .   .   accessions=NC_045512;comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The%0Areference sequence is identical to MN908947.%0AOn Jan 17%2C 2020 this sequence version replaced NC_045512.1.%0AAnnotation was added using homology to SARSr-CoV NC_004718.3. %23%23%23%0AFormerly called %27Wuhan seafood market pneumonia virus.%27 If you have%0Aquestions or suggestions%2C please email us at info%40ncbi.nlm.nih.gov%0Aand include the accession number NC_045512.%23%23%23 Protein structures%0Acan be found at%0Ahttps://www.ncbi.nlm.nih.gov/structure/%3Fterm%3Dsars-cov-2.%23%23%23 Find%0Aall other Severe acute respiratory syndrome coronavirus 2%0A%28SARS-CoV-2%29 sequences at%0Ahttps://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/%0ACOMPLETENESS: full length.;data_file_division=VRL;date=30-MAR-2020;keywords=RefSeq;molecule_type=ss-RNA;organism=Severe acute respiratory syndrome coronavirus 2;references=location: %5B13475:13503%5D%0Aauthors: Baranov%2CP.V.%2C Henderson%2CC.M.%2C Anderson%2CC.B.%2C Gesteland%2CR.F.%2C Atkins%2CJ.F. and Howard%2CM.T.%0Atitle: Programmed ribosomal frameshifting in decoding the SARS-CoV genome%0Ajournal: Virology 332 %282%29%2C 498-510 %282005%29%0Amedline id: %0Apubmed id: 15680415%0Acomment:,location: %5B29727:29768%5D%0Aauthors: Robertson%2CM.P.%2C Igel%2CH.%2C Baertsch%2CR.%2C Haussler%2CD.%2C Ares%2CM. Jr. and Scott%2CW.G.%0Atitle: The structure of a rigorously conserved RNA element within the SARS virus genome%0Ajournal: PLoS Biol. 3 %281%29%2C e5 %282005%29%0Amedline id: %0Apubmed id: 15630477%0Acomment:,location: %5B29608:29657%5D%0Aauthors: Williams%2CG.D.%2C Chang%2CR.Y. and Brian%2CD.A.%0Atitle: A phylogenetically conserved hairpin-type 3%27 untranslated region pseudoknot functions in coronavirus RNA replication%0Ajournal: J. Virol. 73 %2810%29%2C 8349-8355 %281999%29%0Amedline id: %0Apubmed id: 10482585%0Acomment:,location: %5B0:29903%5D%0Aauthors: Wu%2CF.%2C Zhao%2CS.%2C Yu%2CB.%2C Chen%2CY.-M.%2C Wang%2CW.%2C Hu%2CY.%2C Song%2CZ.-G.%2C Tao%2CZ.-W.%2C Tian%2CJ.-H.%2C Pei%2CY.-Y.%2C Yuan%2CM.L.%2C Zhang%2CY.-L.%2C Dai%2CF.-H.%2C Liu%2CY.%2C Wang%2CQ.-M.%2C Zheng%2CJ.-J.%2C Xu%2CL.%2C Holmes%2CE.C. and Zhang%2CY.-Z.%0Atitle: A novel coronavirus associated with a respiratory disease in Wuhan of Hubei province%2C China%0Ajournal: Unpublished%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:29903%5D%0Aauthors: %0Aconsrtm: NCBI Genome Project%0Atitle: Direct Submission%0Ajournal: Submitted %2817-JAN-2020%29 National Center for Biotechnology Information%2C NIH%2C Bethesda%2C MD 20894%2C USA%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:29903%5D%0Aauthors: Wu%2CF.%2C Zhao%2CS.%2C Yu%2CB.%2C Chen%2CY.-M.%2C Wang%2CW.%2C Hu%2CY.%2C Song%2CZ.-G.%2C Tao%2CZ.-W.%2C Tian%2CJ.-H.%2C Pei%2CY.-Y.%2C Yuan%2CM.L.%2C Zhang%2CY.-L.%2C Dai%2CF.-H.%2C Liu%2CY.%2C Wang%2CQ.-M.%2C Zheng%2CJ.-J.%2C Xu%2CL.%2C Holmes%2CE.C. and Zhang%2CY.-Z.%0Atitle: Direct Submission%0Ajournal: Submitted %2805-JAN-2020%29 Shanghai Public Health Clinical Center %26 School of Public Health%2C Fudan University%2C Shanghai%2C China%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=2;source=Severe acute respiratory syndrome coronavirus 2 %28SARS-CoV2%29;structured_comment=OrderedDict%28%5B%28%27Assembly-Data%27%2C OrderedDict%28%5B%28%27Assembly Method%27%2C %27Megahit v. V1.1.3%27%29%2C %28%27Sequencing Technology%27%2C %27Illumina%27%29%5D%29%29%5D%29;taxonomy=Viruses,Riboviria,Nidovirales,Cornidovirineae,Coronaviridae,Orthocoronavirinae,Betacoronavirus,Sarbecovirus;topology=linear
NC_045512.2 feature mRNA    266 21555   .   +   .   ID=transcript:ORF1ab;Parent=gene:ORF1ab;biotype=protein_coding
NC_045512.2 feature gene    266 21555   .   +   .   ID=gene:ORF1ab;Name=ORF1ab;biotype=protein_coding;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01
NC_045512.2 feature CDS 266 13468   .   +   0   ID=CDS:YP_009724389.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009724389.1;ribosomal_slippage=
NC_045512.2 feature CDS 13468   21555   .   +   0   ID=CDS:YP_009724389.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009724389.1;ribosomal_slippage=
NC_045512.2 feature CDS 266 13483   .   +   0   ID=CDS:YP_009725295.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009725295.1
NC_045512.2 feature mRNA    21563   25384   .   +   .   ID=transcript:S;Parent=gene:S;biotype=protein_coding
NC_045512.2 feature gene    21563   25384   .   +   .   ID=gene:S;Name=S;biotype=protein_coding;db_xref=GeneID:43740568;gene=S;gene_synonym=spike glycoprotein;locus_tag=GU280_gp02
NC_045512.2 feature CDS 21563   25384   .   +   0   ID=CDS:YP_009724390.1;Parent=transcript:S;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740568;gene=S;gene_synonym=spike glycoprotein;locus_tag=GU280_gp02;protein_id=YP_009724390.1
NC_045512.2 feature mRNA    25393   26220   .   +   .   ID=transcript:ORF3a;Parent=gene:ORF3a;biotype=protein_coding
NC_045512.2 feature gene    25393   26220   .   +   .   ID=gene:ORF3a;Name=ORF3a;biotype=protein_coding;db_xref=GeneID:43740569;gene=ORF3a;locus_tag=GU280_gp03
NC_045512.2 feature CDS 25393   26220   .   +   0   ID=CDS:YP_009724391.1;Parent=transcript:ORF3a;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740569;gene=ORF3a;locus_tag=GU280_gp03;protein_id=YP_009724391.1
NC_045512.2 feature mRNA    26245   26472   .   +   .   ID=transcript:E;Parent=gene:E;biotype=protein_coding
NC_045512.2 feature gene    26245   26472   .   +   .   ID=gene:E;Name=E;biotype=protein_coding;db_xref=GeneID:43740570;gene=E;locus_tag=GU280_gp04
NC_045512.2 feature CDS 26245   26472   .   +   0   ID=CDS:YP_009724392.1;Parent=transcript:E;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740570;gene=E;locus_tag=GU280_gp04;protein_id=YP_009724392.1
NC_045512.2 feature mRNA    26523   27191   .   +   .   ID=transcript:M;Parent=gene:M;biotype=protein_coding
NC_045512.2 feature gene    26523   27191   .   +   .   ID=gene:M;Name=M;biotype=protein_coding;db_xref=GeneID:43740571;gene=M;locus_tag=GU280_gp05
NC_045512.2 feature CDS 26523   27191   .   +   0   ID=CDS:YP_009724393.1;Parent=transcript:M;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740571;gene=M;locus_tag=GU280_gp05;protein_id=YP_009724393.1
NC_045512.2 feature mRNA    27202   27387   .   +   .   ID=transcript:ORF6;Parent=gene:ORF6;biotype=protein_coding
NC_045512.2 feature gene    27202   27387   .   +   .   ID=gene:ORF6;Name=ORF6;biotype=protein_coding;db_xref=GeneID:43740572;gene=ORF6;locus_tag=GU280_gp06
NC_045512.2 feature CDS 27202   27387   .   +   0   ID=CDS:YP_009724394.1;Parent=transcript:ORF6;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740572;gene=ORF6;locus_tag=GU280_gp06;protein_id=YP_009724394.1
NC_045512.2 feature mRNA    27394   27759   .   +   .   ID=transcript:ORF7a;Parent=gene:ORF7a;biotype=protein_coding
NC_045512.2 feature gene    27394   27759   .   +   .   ID=gene:ORF7a;Name=ORF7a;biotype=protein_coding;db_xref=GeneID:43740573;gene=ORF7a;locus_tag=GU280_gp07
NC_045512.2 feature CDS 27394   27759   .   +   0   ID=CDS:YP_009724395.1;Parent=transcript:ORF7a;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740573;gene=ORF7a;locus_tag=GU280_gp07;protein_id=YP_009724395.1
NC_045512.2 feature mRNA    27756   27887   .   +   .   ID=transcript:ORF7b;Parent=gene:ORF7b;biotype=protein_coding
NC_045512.2 feature gene    27756   27887   .   +   .   ID=gene:ORF7b;Name=ORF7b;biotype=protein_coding;db_xref=GeneID:43740574;gene=ORF7b;locus_tag=GU280_gp08
NC_045512.2 feature CDS 27756   27887   .   +   0   ID=CDS:YP_009725318.1;Parent=transcript:ORF7b;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740574;gene=ORF7b;locus_tag=GU280_gp08;protein_id=YP_009725318.1
NC_045512.2 feature mRNA    27894   28259   .   +   .   ID=transcript:ORF8;Parent=gene:ORF8;biotype=protein_coding
NC_045512.2 feature gene    27894   28259   .   +   .   ID=gene:ORF8;Name=ORF8;biotype=protein_coding;db_xref=GeneID:43740577;gene=ORF8;locus_tag=GU280_gp09
NC_045512.2 feature CDS 27894   28259   .   +   0   ID=CDS:YP_009724396.1;Parent=transcript:ORF8;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740577;gene=ORF8;locus_tag=GU280_gp09;protein_id=YP_009724396.1
NC_045512.2 feature mRNA    28274   29533   .   +   .   ID=transcript:N;Parent=gene:N;biotype=protein_coding
NC_045512.2 feature gene    28274   29533   .   +   .   ID=gene:N;Name=N;biotype=protein_coding;db_xref=GeneID:43740575;gene=N;locus_tag=GU280_gp10
NC_045512.2 feature CDS 28274   29533   .   +   0   ID=CDS:YP_009724397.2;Parent=transcript:N;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740575;gene=N;locus_tag=GU280_gp10;protein_id=YP_009724397.2
NC_045512.2 feature mRNA    29558   29674   .   +   .   ID=transcript:ORF10;Parent=gene:ORF10;biotype=protein_coding
NC_045512.2 feature gene    29558   29674   .   +   .   ID=gene:ORF10;Name=ORF10;biotype=protein_coding;db_xref=GeneID:43740576;gene=ORF10;locus_tag=GU280_gp11
NC_045512.2 feature CDS 29558   29674   .   +   0   ID=CDS:YP_009725255.1;Parent=transcript:ORF10;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740576;gene=ORF10;locus_tag=GU280_gp11;protein_id=YP_009725255.1

pd3 commented 4 years ago

This looks like a bug in the GFF annotations, I don't see how one transcript could have overlapping exons?

jan-glx commented 4 years ago

More like a bug ("feature") in nature. Note the annotation "ribosomal_slippage" ( the corresponding genbank file has this note: "pp1ab; translated by -1 ribosomal frameshift"). Not sure how often such GFF appear but this one alone might be relevant enough to check if a simple fix is possible.

As a workaround I currently use this script to create a GFF from that .genbank file (one transcript per CDS):

#!/usr/bin/env python3

# based on work by Damien Farrell https://dmnfarrell.github.io/bioinformatics/bcftools-csq-gff-format
import sys

def GFF_bcftools_format(in_handle, out_handle):
    """Convert a bacterial genbank file from NCBI to a GFF3 format that can be used in bcftools csq.
    see https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098.
    Args:
        in_file: genbank file
        out_file: name of GFF file
    """

    from BCBio import GFF
    #in_handle = open(in_file)
    #out_handle = open(out_file, "w")
    from Bio.SeqFeature import SeqFeature
    from Bio.SeqFeature import FeatureLocation
    from copy import copy, deepcopy
    from Bio import SeqIO

    for record in SeqIO.parse(in_handle, "genbank"):
        #make a copy of the record as we will be changing it during the loop
        new = copy(record)
        new.features = []
        #loop over all features
        for feat in record.features:
            q = feat.qualifiers
            #remove some unecessary qualifiers
            for label in ['note','translation','product','experiment']:
                if label in q:
                    del q[label]

            if(feat.type == "CDS"):
                #use the CDS feature to create the new lines
                tag = q['gene'][0] #q['locus_tag'][0]
                protein_id = q['protein_id'][0]
                q['ID'] = 'CDS:%s' %protein_id
                q['biotype'] = 'protein_coding'

                for i, new_loc in enumerate(feat.location.parts if(hasattr(feat.location, "parts")) else (feat.location,)):
                        new_feat = deepcopy(feat)
                        tr_id = 'transcript:%s' %(protein_id+"_"+str(i))
                        new_feat.qualifiers['Parent'] = tr_id
                        new_feat.location = new_loc
                        new.features.append(new_feat)
                        #create mRNA feature
                        m = SeqFeature(feat.location, type='mRNA',strand=feat.strand)
                        q2 = m.qualifiers
                        q2['ID'] = tr_id
                        q2['Parent'] = 'gene:%s' %tag
                        q2['biotype'] = 'protein_coding'
                        new.features.append(m)

            elif(feat.type == "gene"):
                tag = q['gene'][0]
                #edit the gene feature
                q=feat.qualifiers
                q['ID'] = 'gene:%s' %tag
                q['biotype'] = 'protein_coding'
                q['Name'] = q['gene']
                new.features.append(feat)
        #write the new features to a GFF
        GFF.write([new], out_handle)
        return

if __name__ == "__main__":
   GFF_bcftools_format(sys.stdin, sys.stdout)

pd3 commented 4 years ago

I don't understand what the GFF tells us about the translation of the exons 266-13468 and 266-13483. The note suggests a -1 ribosomal slippage but in the GFF the exon is basically duplicated, just elongated by 15bp.

jan-glx commented 4 years ago

OK yes, bcftools csq has two separate issues with this file:

It does not currently allow for a single CDS to consist of overlapping parts of the reference. This prevents correct variant effect prediction in the case of negative ribosomal slippage. There is no workaround possible since the predicted aminoacis changes would be incorrect. This is what I opened this issue for. The problem is in these lines (more precisely with 13468 occurring as both CDS end and start):

NC_045512.2 feature mRNA    266 21555   .   +   .   ID=transcript:ORF1ab;Parent=gene:ORF1ab;biotype=protein_coding
NC_045512.2 feature gene    266 21555   .   +   .   ID=gene:ORF1ab;Name=ORF1ab;biotype=protein_coding;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01
NC_045512.2 feature CDS 266 13468   .   +   0   ID=CDS:YP_009724389.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009724389.1;ribosomal_slippage=
NC_045512.2 feature CDS 13468   21555   .   +   0   ID=CDS:YP_009724389.1;Parent=transcript:ORF1ab;biotype=protein_coding;codon_start=1;db_xref=GeneID:43740578;gene=ORF1ab;locus_tag=GU280_gp01;protein_id=YP_009724389.1;ribosomal_slippage=

It does not allow for multiple CDS on the same transcript (at least not if they overlap). This can also happen e.g. with polycistronic mRNA or with internal ribosome entry sites but is perhaps of minor interest and can easily be worked around by duplicating corresponding transcript.

The official (NCBI) GFF (below) further has the issue that it contains no transcript annotations at all & uses - as separate within the ID fields, but this, again, can easily worked around and it is arguably outside the scope of this project to parse any GFF file.

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM985889v3
#!genome-build-accession NCBI_Assembly:GCF_009858895.2
##sequence-region NC_045512.2 1 29903
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049
NC_045512.2 RefSeq  region  1   29903   .   +   .   ID=NC_045512.2:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus
NC_045512.2 RefSeq  five_prime_UTR  1   265 .   +   .   ID=id-NC_045512.2:1..265;gbkey=5'UTR
NC_045512.2 RefSeq  gene    266 21555   .   +   .   ID=gene-GU280_gp01;Dbxref=GeneID:43740578;Name=ORF1ab;gbkey=Gene;gene=ORF1ab;gene_biotype=protein_coding;locus_tag=GU280_gp01
NC_045512.2 RefSeq  CDS 266 13468   .   +   0   ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1
NC_045512.2 RefSeq  CDS 13468   21555   .   +   0   ID=cds-YP_009724389.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009724389.1,GeneID:43740578;Name=YP_009724389.1;Note=pp1ab%3B translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1ab polyprotein;protein_id=YP_009724389.1
NC_045512.2 RefSeq  CDS 266 13483   .   +   0   ID=cds-YP_009725295.1;Parent=gene-GU280_gp01;Dbxref=Genbank:YP_009725295.1,GeneID:43740578;Name=YP_009725295.1;Note=pp1a;gbkey=CDS;gene=ORF1ab;locus_tag=GU280_gp01;product=ORF1a polyprotein;protein_id=YP_009725295.1
NC_045512.2 RefSeq  stem_loop   13476   13503   .   +   .   ID=id-GU280_gp01;Dbxref=GeneID:43740578;function=Coronavirus frameshifting stimulation element stem-loop 1;gbkey=stem_loop;gene=ORF1ab;inference=COORDINATES: profile:Rfam-release-14.1:RF00507%2CInfernal:1.1.2;locus_tag=GU280_gp01
NC_045512.2 RefSeq  stem_loop   13488   13542   .   +   .   ID=id-GU280_gp01-2;Dbxref=GeneID:43740578;function=Coronavirus frameshifting stimulation element stem-loop 2;gbkey=stem_loop;gene=ORF1ab;inference=COORDINATES: profile:profile:Rfam-release-14.1:RF00507%2CInfernal:1.1.2;locus_tag=GU280_gp01
NC_045512.2 RefSeq  gene    21563   25384   .   +   .   ID=gene-GU280_gp02;Dbxref=GeneID:43740568;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding;gene_synonym=spike glycoprotein;locus_tag=GU280_gp02
NC_045512.2 RefSeq  CDS 21563   25384   .   +   0   ID=cds-YP_009724390.1;Parent=gene-GU280_gp02;Dbxref=Genbank:YP_009724390.1,GeneID:43740568;Name=YP_009724390.1;Note=structural protein%3B spike protein;gbkey=CDS;gene=S;locus_tag=GU280_gp02;product=surface glycoprotein;protein_id=YP_009724390.1
NC_045512.2 RefSeq  gene    25393   26220   .   +   .   ID=gene-GU280_gp03;Dbxref=GeneID:43740569;Name=ORF3a;gbkey=Gene;gene=ORF3a;gene_biotype=protein_coding;locus_tag=GU280_gp03
NC_045512.2 RefSeq  CDS 25393   26220   .   +   0   ID=cds-YP_009724391.1;Parent=gene-GU280_gp03;Dbxref=Genbank:YP_009724391.1,GeneID:43740569;Name=YP_009724391.1;gbkey=CDS;gene=ORF3a;locus_tag=GU280_gp03;product=ORF3a protein;protein_id=YP_009724391.1
NC_045512.2 RefSeq  gene    26245   26472   .   +   .   ID=gene-GU280_gp04;Dbxref=GeneID:43740570;Name=E;gbkey=Gene;gene=E;gene_biotype=protein_coding;locus_tag=GU280_gp04
NC_045512.2 RefSeq  CDS 26245   26472   .   +   0   ID=cds-YP_009724392.1;Parent=gene-GU280_gp04;Dbxref=Genbank:YP_009724392.1,GeneID:43740570;Name=YP_009724392.1;Note=ORF4%3B structural protein%3B E protein;gbkey=CDS;gene=E;locus_tag=GU280_gp04;product=envelope protein;protein_id=YP_009724392.1
NC_045512.2 RefSeq  gene    26523   27191   .   +   .   ID=gene-GU280_gp05;Dbxref=GeneID:43740571;Name=M;gbkey=Gene;gene=M;gene_biotype=protein_coding;locus_tag=GU280_gp05
NC_045512.2 RefSeq  CDS 26523   27191   .   +   0   ID=cds-YP_009724393.1;Parent=gene-GU280_gp05;Dbxref=Genbank:YP_009724393.1,GeneID:43740571;Name=YP_009724393.1;Note=ORF5%3B structural protein;gbkey=CDS;gene=M;locus_tag=GU280_gp05;product=membrane glycoprotein;protein_id=YP_009724393.1
NC_045512.2 RefSeq  gene    27202   27387   .   +   .   ID=gene-GU280_gp06;Dbxref=GeneID:43740572;Name=ORF6;gbkey=Gene;gene=ORF6;gene_biotype=protein_coding;locus_tag=GU280_gp06
NC_045512.2 RefSeq  CDS 27202   27387   .   +   0   ID=cds-YP_009724394.1;Parent=gene-GU280_gp06;Dbxref=Genbank:YP_009724394.1,GeneID:43740572;Name=YP_009724394.1;gbkey=CDS;gene=ORF6;locus_tag=GU280_gp06;product=ORF6 protein;protein_id=YP_009724394.1
NC_045512.2 RefSeq  gene    27394   27759   .   +   .   ID=gene-GU280_gp07;Dbxref=GeneID:43740573;Name=ORF7a;gbkey=Gene;gene=ORF7a;gene_biotype=protein_coding;locus_tag=GU280_gp07
NC_045512.2 RefSeq  CDS 27394   27759   .   +   0   ID=cds-YP_009724395.1;Parent=gene-GU280_gp07;Dbxref=Genbank:YP_009724395.1,GeneID:43740573;Name=YP_009724395.1;gbkey=CDS;gene=ORF7a;locus_tag=GU280_gp07;product=ORF7a protein;protein_id=YP_009724395.1
NC_045512.2 RefSeq  gene    27756   27887   .   +   .   ID=gene-GU280_gp08;Dbxref=GeneID:43740574;Name=ORF7b;gbkey=Gene;gene=ORF7b;gene_biotype=protein_coding;locus_tag=GU280_gp08
NC_045512.2 RefSeq  CDS 27756   27887   .   +   0   ID=cds-YP_009725318.1;Parent=gene-GU280_gp08;Dbxref=Genbank:YP_009725318.1,GeneID:43740574;Name=YP_009725318.1;gbkey=CDS;gene=ORF7b;locus_tag=GU280_gp08;product=ORF7b;protein_id=YP_009725318.1
NC_045512.2 RefSeq  gene    27894   28259   .   +   .   ID=gene-GU280_gp09;Dbxref=GeneID:43740577;Name=ORF8;gbkey=Gene;gene=ORF8;gene_biotype=protein_coding;locus_tag=GU280_gp09
NC_045512.2 RefSeq  CDS 27894   28259   .   +   0   ID=cds-YP_009724396.1;Parent=gene-GU280_gp09;Dbxref=Genbank:YP_009724396.1,GeneID:43740577;Name=YP_009724396.1;gbkey=CDS;gene=ORF8;locus_tag=GU280_gp09;product=ORF8 protein;protein_id=YP_009724396.1
NC_045512.2 RefSeq  gene    28274   29533   .   +   .   ID=gene-GU280_gp10;Dbxref=GeneID:43740575;Name=N;gbkey=Gene;gene=N;gene_biotype=protein_coding;locus_tag=GU280_gp10
NC_045512.2 RefSeq  CDS 28274   29533   .   +   0   ID=cds-YP_009724397.2;Parent=gene-GU280_gp10;Dbxref=Genbank:YP_009724397.2,GeneID:43740575;Name=YP_009724397.2;Note=ORF9%3B structural protein;gbkey=CDS;gene=N;locus_tag=GU280_gp10;product=nucleocapsid phosphoprotein;protein_id=YP_009724397.2
NC_045512.2 RefSeq  gene    29558   29674   .   +   .   ID=gene-GU280_gp11;Dbxref=GeneID:43740576;Name=ORF10;gbkey=Gene;gene=ORF10;gene_biotype=protein_coding;locus_tag=GU280_gp11
NC_045512.2 RefSeq  CDS 29558   29674   .   +   0   ID=cds-YP_009725255.1;Parent=gene-GU280_gp11;Dbxref=Genbank:YP_009725255.1,GeneID:43740576;Name=YP_009725255.1;gbkey=CDS;gene=ORF10;locus_tag=GU280_gp11;product=ORF10 protein;protein_id=YP_009725255.1
NC_045512.2 RefSeq  stem_loop   29609   29644   .   +   .   ID=id-GU280_gp11;Dbxref=GeneID:43740576;function=Coronavirus 3' UTR pseudoknot stem-loop 1;gbkey=stem_loop;gene=ORF10;inference=COORDINATES: profile::Rfam-release-14.1:RF00165%2CInfernal:1.1.2;locus_tag=GU280_gp11
NC_045512.2 RefSeq  stem_loop   29629   29657   .   +   .   ID=id-GU280_gp11-2;Dbxref=GeneID:43740576;function=Coronavirus 3' UTR pseudoknot stem-loop 2;gbkey=stem_loop;gene=ORF10;inference=COORDINATES: profile::Rfam-release-14.1:RF00165%2CInfernal:1.1.2;locus_tag=GU280_gp11
NC_045512.2 RefSeq  three_prime_UTR 29675   29903   .   +   .   ID=id-NC_045512.2:29675..29903;gbkey=3'UTR
NC_045512.2 RefSeq  stem_loop   29728   29768   .   +   .   ID=id-NC_045512.2:29728..29768;Note=basepair exception: alignment to the Rfam model implies coordinates 29740:29758 form a noncanonical C:T basepair%2C but the homologous positions form a highly conserved C:G basepair in other viruses%2C including SARS (NC_004718.3);function=Coronavirus 3' stem-loop II-like motif (s2m);gbkey=stem_loop;inference=COORDINATES: profile:Rfam-release-14.1:RF00164%2CInfernal:1.1.2
###

Hope this clarifies the issue, I should have posted just the relevant lines of the GFF. Do you think that this can easily fixed, e.g. by deescalating the error to a warning?

pd3 commented 4 years ago

I will check what can be done about the case of ribosomal slippage. Do you know how common is the first case?

The second case of polycistronic mRNA and internal ribosome entry sites is more complicated because the program operates under the simplifying assumption that one transcript = one protein product. However, as you say, that can be easily worked around by creating a new transcript.

Similar issue was raised twice in past (https://github.com/samtools/bcftools/issues/530#issuecomment-268278248, https://github.com/samtools/bcftools/issues/1078#issuecomment-527484831). To reiterate, I'd be happy to host a gff2gff script in bcftools/misc to help with conversion from the various flavors of gff into one that bcftools support.

jan-glx commented 4 years ago

Great! I have no idea how common it is that a CDS overlaps with itself. Just like csq won't accept all GFF flavors, I am afraid a universal gff2gff script would be hard to implement. But a link to #530 or a dedicated wiki page in related error messages might help...

pd3 commented 4 years ago

It would have to be built slowly, case by case. Right now we'd have three flavors already. (There are two issues, the github markdown hid the second comment.)

pd3 commented 4 years ago

This is now possible, please let me know if you encounter any problems.

samtools / bcftools

[FR] overlapping CDS #1208