oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
411 stars 44 forks source link

Is there a way to deactivate the overlap detection so bakta does not filter my input proteins? #295

Open Daniel-Tichy opened 1 week ago

Daniel-Tichy commented 1 week ago

-I would like to deactivate the overlap detection so bakta does not filter the previously predicted proteins that I am using as input.

Example: this is my input gbk for bakta.

 CDS             3417..3809
                 /ID="WARQSXNU_CDS_9"
                 /phrog="786"
                 /top_hit="p65745 VI_07030"
                 /locus_tag="WARQSXNU_9"
                 /function="unknown function"
                 /product="hypothetical protein"
                 /source="PHANOTATE"
                 /score="-22.41946661155013"
                 /phase="0"
                 /translation="MAAPTPEELVSQMASRGMTITTTDASGILCLVASISECLELNYPN
                 DECRQNAIMLWASILISANTAGRYVTSQSAPSGASQSFAYGSKPWVALYNQMKLLDSAG
                 CTGDLVEDPDGSGKPWFAVVRGSKCK"
 CDS             3806..4147
                 /ID="WARQSXNU_CDS_10"
                 /phrog="797"
                 /top_hit="p299466 VI_10274"
                 /locus_tag="WARQSXNU_10"
                 /function="unknown function"
                 /product="hypothetical protein"
                 /source="PHANOTATE"
                 /score="-111.69024253224252"
                 /phase="0"
                 /translation="MTSLARFSYTQPCTIWHKSGTDKYGKPTFDAPVSIMCDYGFNDDV
                 STDAKGNEIVQKNTFWTEYTGAKVGDYIMIGTMIEADPLVAGANQILNVINYGNTFQRS
                 EPPDFALVT"
 CDS             4147..4545
                 /ID="WARQSXNU_CDS_11"
                 /phrog="No_PHROG"
                 /top_hit="No_PHROG"
                 /locus_tag="WARQSXNU_11"
                 /function="unknown function"
                 /product="hypothetical protein"
                 /source="PHANOTATE"
                 /score="-52.42604964159676"
                 /phase="0"
                 /translation="MPAKLRGVRKAVERTSQIVDEIIATKAVRALKSATYIIRTESATL
                 TPIDTSTLINSQFDTVEVSGTRITGKVGYSAKYALYVHNASGKLAGKPRSNGNGTYWSP
                 GGEPQFLTKAAQRTKDLVDGVIKKEMKL"

I parse it and input it in the following format to bakta.

WARQSXNU_9 ~hypothetical protein~ MAAPTPEELVSQMASRGMTITTTDASGILCLVASISECLELNYPNDECRQNAIMLWASIL ISANTAGRYVTSQSAPSGASQSFAYGSKPWVALYNQMKLLDSAGCTGDLVEDPDGSGKPW FAVVRGSKCK WARQSXNU_10 ~hypothetical protein~ MTSLARFSYTQPCTIWHKSGTDKYGKPTFDAPVSIMCDYGFNDDVSTDAKGNEIVQKNTF WTEYTGAKVGDYIMIGTMIEADPLVAGANQILNVINYGNTFQRSEPPDFALVT WARQSXNU_11 ~hypothetical protein~ MPAKLRGVRKAVERTSQIVDEIIATKAVRALKSATYIIRTESATLTPIDTSTLINSQFDT VEVSGTRITGKVGYSAKYALYVHNASGKLAGKPRSNGNGTYWSPGGEPQFLTKAAQRTKD LVDGVIKKEMKL

But I get this output, the protein for WARQSXNU_10 is missing probably because of the overlap in the genome.

 gene            complement(40007..40405)
                 /locus_tag="MKOBIG_00315"
 CDS             complement(40007..40405)
                 /db_xref="SO:0001217"
                 /db_xref="UniRef:UniRef50_W7P0V4"
                 /db_xref="UniRef:UniRef90_A0A1B1W263"
                 /db_xref="UserProtein:WARQSXNU_11"
                 /product="hypothetical protein"
                 /locus_tag="MKOBIG_00315"
                 /protein_id="gnl|Bakta|MKOBIG_00315"
                 /translation="MPAKLRGVRKAVERTSQIVDEIIATKAVRALKSATYIIRTESATL
                 TPIDTSTLINSQFDTVEVSGTRITGKVGYSAKYALYVHNASGKLAGKPRSNGNGTYWSP
                 GGEPQFLTKAAQRTKDLVDGVIKKEMKL"
                 /codon_start=1
                 /transl_table=11
                 /inference="ab initio prediction:Prodigal:2.6"
                 /inference="similar to AA
                 sequence:UniRef:UniRef90_A0A1B1W263"
 gene            complement(40743..41135)
                 /locus_tag="MKOBIG_00320"
 CDS             complement(40743..41135)
                 /db_xref="SO:0001217"
                 /db_xref="UniRef:UniRef50_A0A173GBZ4"
                 /db_xref="UniRef:UniRef90_A0A1B1W265"
                 /db_xref="UserProtein:WARQSXNU_9"
                 /product="hypothetical protein"
                 /locus_tag="MKOBIG_00320"
                 /protein_id="gnl|Bakta|MKOBIG_00320"
                 /translation="MAAPTPEELVSQMASRGMTITTTDASGILCLVASISECLELNYPN
                 DECRQNAIMLWASILISANTAGRYVTSQSAPSGASQSFAYGSKPWVALYNQMKLLDSAG
                 CTGDLVEDPDGSGKPWFAVVRGSKCK"
                 /codon_start=1
                 /transl_table=11
                 /inference="ab initio prediction:Prodigal:2.6"
                 /inference="similar to AA
                 sequence:UniRef:UniRef90_A0A1B1W265"

I am currently running bakta with this line within a docker. bakta --db $bakta_db/ --protein $faa_input_bakta --skip-trna --skip-tmrna --skip-rrna --skip-ncrna --skip-ncrna-region --skip-crispr --skip-pseudo --skip-gap --skip-ori --skip-plot --output ${assembly_input_bakta.simpleName}_bakta/ --threads ${params.threads} $assembly_input_bakta

oschwengers commented 1 week ago

Hi, thanks for reaching out. To make sure that I correctly understand what you're finally trying to achieve: you would like to annotate a phage genome sequence with Bakta using a user-provided proteins file with functional annotations from Phanotate? Is this correct?