ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

[FEATURE REQUEST] <title> #178

Closed kpenn88 closed 2 years ago

kpenn88 commented 2 years ago

The gbk that results from PGAP has a block that looks like below, is there a parser that can handle this and turn it into a table? of columns:

Genome-Annotation-Data-START

        Annotation Provider               :: Organization
        Annotation Date                   :: 12/10/2021 06:35:25
        Annotation Pipeline               :: NCBI Prokaryotic Genome
                                             Annotation Pipeline (PGAP)
        Annotation Method                 :: Best-placed reference protein
                                             set; GeneMarkS-2+
        Annotation Software revision      :: 2021-11-29.build5742
        Features Annotated                :: Gene; CDS; rRNA; tRNA; ncRNA;
                                             repeat_region
        Genes (total)                     :: 2,716
        CDSs (total)                      :: 2,669
        Genes (coding)                    :: 2,596
        CDSs (with protein)               :: 2,596
        Genes (RNA)                       :: 47
        tRNAs                             :: 43
        ncRNAs                            :: 4
        Pseudo Genes (total)              :: 73
        CDSs (without protein)            :: 73
        Pseudo Genes (ambiguous residues) :: 2 of 73
        Pseudo Genes (frameshifted)       :: 45 of 73
        Pseudo Genes (incomplete)         :: 25 of 73
        Pseudo Genes (internal stop)      :: 11 of 73
        Pseudo Genes (multiple problems)  :: 9 of 73
        CRISPR Arrays                     :: 1
        ##Genome-Annotation-Data-END##
thibaudnis commented 2 years ago

Oops, @kpenn88 - we missed your question. I am not aware of any parser for this information.

azat-badretdin commented 2 years ago

Kevin, this section of the GBK format is intended purely for human consumption. For bioinformatics purposes, we provide more machine-consumption friendly files (in ASN.1 format).