Closed loulanomics closed 4 months ago
Hi @loulanomics , thanks for reaching out. Excellent questions, let me answer them in different order:
1.
Yes, it's exactly as you already mentions. Bakta stores protein sequence hash values in the SQLite db to quickly identify genes and thus skipping costly seq alignments via Diamond AFSI. However, this can become a bottleneck if the Bakta db and as a part of that the SQLite db is stored on a remote, i.e. network-attached volume. So, as Bakta is executed on a SLURM node, the SQLIte db is not stored on this node but rather another machine. As a consequence SQLite causes a lot network-bound IO with can substantially reduce the overall runtime performance
3.
The light db version does not contain the protein hashes, so no AFSI is conducted. In addition, it does not contain PSC
(UniRef90) representatives diamond db but only PSCC
(UniRef50) representatives of clusters with at least 10 members. By this, we could achieve a tremendous reduction in size without sacrificing too much. For well-studied species, you still should get decent protein descriptions which might be less specific.
2.
This having said, the db type has only little to no effect on the CDS detection. CDS prediction is conducted with Prodigal. However, there are only 2 tiny differences: 1. As there are no protein hashes, Bakta cannot detect sORF via the AFSI approach, so you might see less sORFs detected. 2. As there are no PSC, Bakta cannot detect pseudogenes. Beyond that, everything is exactly the same.
I hope this helps to find the best setting for your situation. If you have any further questions, please do not hesitate to keep asking.
Thank you for your responses, they are very helpful! I do have a few follow-up questions I will try to explain as clearly as possible.
Below are rows from a Bakta TSV output (using --keep-contig-headers
) for a contig called k141_10015
:
Sequence Id | Type | Start | Stop | Strand | Locus Tag | Gene |
---|---|---|---|---|---|---|
k141_10015 | cds | 29 | 385 | + | JEGDBF_57970 | |
k141_10015 | cds | 442 | 870 | + | JEGDBF_57975 | |
k141_10015 | cds | 970 | 1302 | - | JEGDBF_57980 | hicB |
k141_10015 | cds | 1299 | 1547 | - | JEGDBF_57985 | hicA |
Here is the .fna
of same contig run separately through Prodigal:
>k141_10015_1 # 1 # 870 # 1 # ID=20802_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.384
GAATACAGAGAAAGAAGCACAACAAGCATTGATTTCAGCACTGAAGACCGAGCTTTCAAAGCTCGAACAG...
>k141_10015_2 # 970 # 1302 # -1 # ID=20802_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.363
ATGAGCTATCTAAAATACAAAGGTTATCTTGGTACGATTGAACCTGATCTTGAAACAGGTGAGTTATTTG...
>k141_10015_3 # 1299 # 1547 # -1 # ID=20802_3;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.373
ATGGGTAAAACGGAAAAGTTATTAGAAAAATTCGGAAATGCAAAAAATACATTTCCGTATAAGGATTTAG...
Prodigal creates unique CDS identifiers by appending an underscore and number to the original contig header. It appears that Bakta's identifiers are in the Locus Tag
column, but this does not contain information about the original contig. I have been working around this by using the Prodigal .fna
as input to Bakta, producing this:
Sequence Id | Type | Start | Stop | Strand | Locus Tag | Gene |
---|---|---|---|---|---|---|
k141_10015_1 | cds | 29 | 385 | + | JEGDBF_57970 | |
k141_10015_1 | cds | 442 | 870 | + | JEGDBF_57975 | |
k141_10015_2 | cds | 970 | 1302 | - | JEGDBF_57980 | hicB |
k141_10015_3 | cds | 1299 | 1547 | - | JEGDBF_57985 | hicA |
But ideally, the Sequence Id
column would be something like k141_10015_1
, k141_10015_2
, k141_10015_3
, k141_10015_4
. So my questions are:
What is Locus Tag
? What do the letters and numbers mean? Why are they +5?
Is there a built in way to create unique contig + CDS identifers? I could create my own (e.g. k141_10015+JEGDBF_57970
) but it's a little less intuitive. I see suggestions here but I'm not sure if they pertain to me.
Thanks again!
Hi @loulanomics , random user here but just jumping in
Question 1
What is
Locus Tag
? What do the letters and numbers mean? Why are they +5?
These are ways to identify genes instead of just names. THe letters are randomly generated, except if you provide one yourself as in the Genome submission example. The numbers are indexing the genes. The +5 is actually a smart way to make sure the numbering stays consistent even in the event of new genes being discovered in between, because of a better annotation etc. You don't need to have to write JEGDBF_57980.1 if you have another gene between JEGDBF_57980 and JEGDBF_57981 but you have the space for JEGDBF_57981 and others because they are +5 spaced.
In case: https://www.ncbi.nlm.nih.gov/genbank/genome_locustag/ Best,
Thanks @loulanomics for asking and @cpauvert for jumping in - perfect answer addressing question 1.
Regarding question 2: Actually, the locus_tag must be unique for each genomic feature by design. So, even though the contig ID is not included in the locus_tag, it is unique within any given genome. If this doesn't serve your needs, you could easily build those yourself as you've suggested. However, the input you mentioned is meant for naming contigs. Hope that helps. Best Oliver
Hello, we are currently integrating Bakta
v.1.3.1v.1.9.3 into a metagenomics pipeline. The entire pipeline is run on our HPC cluster, where the necessary databases are also stored.It takes about ~1 day to annotate a ~500MB contigs fasta using the full (not light) Bakta database with 500GB memory. This is okay, but not ideal, so in hopes of speeding it up, we have a few questions.
Question 1
About this FAQ,
Given we are on a SLURM-managed cluster, is it possible that this is the reason run times are so long? As mentioned, the database is stored "locally" in our working directory, but is it considered remote/network because the task is transferred to compute nodes (or some other reason)?
Question 2
You mention,
db-light
could still suit our needs because we are not interested in all possible/hypothetical proteins. However, we do care about all possible CDSs/ORFs as a starting point for more-specific gene annotations later on. If we usedb-light
, are we losing any ORF detection? Or would accompanying a tool likeProdigal
be more suitable?Question 3
Generally, what are the differences between the light and full versions? (feel free to link if this is explained elsewhere)
Thank you!