Bakta run times on cluster

loulanomics commented 7 months ago

Hello, we are currently integrating Bakta ~~v.1.3.1~~ v.1.9.3 into a metagenomics pipeline. The entire pipeline is run on our HPC cluster, where the necessary databases are also stored.

It takes about ~1 day to annotate a ~500MB contigs fasta using the full (not light) Bakta database with 500GB memory. This is okay, but not ideal, so in hopes of speeding it up, we have a few questions.

Question 1

About this FAQ,

Bakta takes advantage of an SQLite DB which results in high storage IO loads. If this DB is stored on a remote / network volume, the lookup of IPS/PSC annotations might take a long time. In these cases, please, consider moving the DB to a local volume or hard drive.

Given we are on a SLURM-managed cluster, is it possible that this is the reason run times are so long? As mentioned, the database is stored "locally" in our working directory, but is it considered remote/network because the task is transferred to compute nodes (or some other reason)?

Question 2

You mention,

If you seek for maximum runtime performance or if download time/storage requirements are an issue, please try the light version.

db-light could still suit our needs because we are not interested in all possible/hypothetical proteins. However, we do care about all possible CDSs/ORFs as a starting point for more-specific gene annotations later on. If we use db-light, are we losing any ORF detection? Or would accompanying a tool like Prodigal be more suitable?

Question 3

Generally, what are the differences between the light and full versions? (feel free to link if this is explained elsewhere)

Thank you!

oschwengers commented 7 months ago

Hi @loulanomics , thanks for reaching out. Excellent questions, let me answer them in different order:

1. Yes, it's exactly as you already mentions. Bakta stores protein sequence hash values in the SQLite db to quickly identify genes and thus skipping costly seq alignments via Diamond AFSI. However, this can become a bottleneck if the Bakta db and as a part of that the SQLite db is stored on a remote, i.e. network-attached volume. So, as Bakta is executed on a SLURM node, the SQLIte db is not stored on this node but rather another machine. As a consequence SQLite causes a lot network-bound IO with can substantially reduce the overall runtime performance

3. The light db version does not contain the protein hashes, so no AFSI is conducted. In addition, it does not contain PSC (UniRef90) representatives diamond db but only PSCC (UniRef50) representatives of clusters with at least 10 members. By this, we could achieve a tremendous reduction in size without sacrificing too much. For well-studied species, you still should get decent protein descriptions which might be less specific.

2. This having said, the db type has only little to no effect on the CDS detection. CDS prediction is conducted with Prodigal. However, there are only 2 tiny differences: 1. As there are no protein hashes, Bakta cannot detect sORF via the AFSI approach, so you might see less sORFs detected. 2. As there are no PSC, Bakta cannot detect pseudogenes. Beyond that, everything is exactly the same.

I hope this helps to find the best setting for your situation. If you have any further questions, please do not hesitate to keep asking.

loulanomics commented 7 months ago

Thank you for your responses, they are very helpful! I do have a few follow-up questions I will try to explain as clearly as possible.

Below are rows from a Bakta TSV output (using --keep-contig-headers) for a contig called k141_10015:

Sequence Id	Type	Start	Stop	Strand	Locus Tag	Gene
k141_10015	cds	29	385	+	JEGDBF_57970
k141_10015	cds	442	870	+	JEGDBF_57975
k141_10015	cds	970	1302	-	JEGDBF_57980	hicB
k141_10015	cds	1299	1547	-	JEGDBF_57985	hicA

Here is the .fna of same contig run separately through Prodigal:

>k141_10015_1 # 1 # 870 # 1 # ID=20802_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.384
GAATACAGAGAAAGAAGCACAACAAGCATTGATTTCAGCACTGAAGACCGAGCTTTCAAAGCTCGAACAG...
>k141_10015_2 # 970 # 1302 # -1 # ID=20802_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.363
ATGAGCTATCTAAAATACAAAGGTTATCTTGGTACGATTGAACCTGATCTTGAAACAGGTGAGTTATTTG...
>k141_10015_3 # 1299 # 1547 # -1 # ID=20802_3;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=3bp;gc_cont=0.373
ATGGGTAAAACGGAAAAGTTATTAGAAAAATTCGGAAATGCAAAAAATACATTTCCGTATAAGGATTTAG...

Prodigal creates unique CDS identifiers by appending an underscore and number to the original contig header. It appears that Bakta's identifiers are in the Locus Tag column, but this does not contain information about the original contig. I have been working around this by using the Prodigal .fna as input to Bakta, producing this:

Sequence Id	Type	Start	Stop	Strand	Locus Tag	Gene
k141_10015_1	cds	29	385	+	JEGDBF_57970
k141_10015_1	cds	442	870	+	JEGDBF_57975
k141_10015_2	cds	970	1302	-	JEGDBF_57980	hicB
k141_10015_3	cds	1299	1547	-	JEGDBF_57985	hicA

But ideally, the Sequence Id column would be something like k141_10015_1, k141_10015_2, k141_10015_3, k141_10015_4. So my questions are:

Question 1

What is Locus Tag? What do the letters and numbers mean? Why are they +5?

Question 2

Is there a built in way to create unique contig + CDS identifers? I could create my own (e.g. k141_10015+JEGDBF_57970) but it's a little less intuitive. I see suggestions here but I'm not sure if they pertain to me.

Thanks again!

cpauvert commented 6 months ago

Hi @loulanomics , random user here but just jumping in

Question 1

What is Locus Tag? What do the letters and numbers mean? Why are they +5?

These are ways to identify genes instead of just names. THe letters are randomly generated, except if you provide one yourself as in the Genome submission example. The numbers are indexing the genes. The +5 is actually a smart way to make sure the numbering stays consistent even in the event of new genes being discovered in between, because of a better annotation etc. You don't need to have to write JEGDBF_57980.1 if you have another gene between JEGDBF_57980 and JEGDBF_57981 but you have the space for JEGDBF_57981 and others because they are +5 spaced.

In case: https://www.ncbi.nlm.nih.gov/genbank/genome_locustag/ Best,

oschwengers commented 6 months ago

Thanks @loulanomics for asking and @cpauvert for jumping in - perfect answer addressing question 1.

Regarding question 2: Actually, the locus_tag must be unique for each genomic feature by design. So, even though the contig ID is not included in the locus_tag, it is unique within any given genome. If this doesn't serve your needs, you could easily build those yourself as you've suggested. However, the input you mentioned is meant for naming contigs. Hope that helps. Best Oliver

oschwengers / bakta