Batch annotation of protein sequences

oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

GNU General Public License v3.0

431 stars 53 forks source link

Batch annotation of protein sequences #97

Closed oschwengers closed 2 years ago

oschwengers commented 2 years ago

Let's add a bulk annotation feature for protein sequences. Just like bakta_db we could add an entry point to provide a dedicated interface.

Entry point: bakta_batch Parameters:

<input> as a metavar
--db <db-path>
--output <output>
--prefix <prefix>
--proteins <user-proteins>
--tmp-dir <tmp-dir>
--threads <threads>
...

Output: a simple TSV could probably work with the following columns:

id
gene
product
EC number
COG
GO terms
UniRef IDs

Suggested by @conmeehan on https://microbial-bioinfo.slack.com

oschwengers commented 2 years ago

A first version of the direct batch annotation of protein sequences is implemented. It might take a couple of weeks until the next release. If someone likes to give it a try in advance:

Installation:

git clone https://github.com/oschwengers/bakta.git
cd bakta
git checkout batch
python -m pip install --no-deps --ignore-installed .

Example:

$ bakta_batch --db <db-path> input.fasta
$ bakta_batch --db <db-path> --prefix test --output test --proteins special.faa --threads 8 input.fasta

Output:

<prefix>.tsv: full annotation results
<prefix>.hypotheticals.tsv: additional info on hypotheticals (mol weight, iso el. point, Pfam hits)
<prefix>.faa: annotated protein sequences

conmeehan commented 2 years ago

Hi, I am running the command outlined above and I am getting the following error when running:

/Users/cmeehan/Tools/bakta/bin/bakta_batch: line 3: realpath: command not found usage: dirname string [...] /Users/cmeehan/Tools/bakta/bin/bakta_batch: line 4: realpath: command not found annotate protein sequences... detected IPSs: 0 PSC failed! diamond-error-code=1 Traceback (most recent call last): File "/Users/cmeehan/opt/miniconda3/envs/bakta/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/cmeehan/opt/miniconda3/envs/bakta/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/cmeehan/opt/miniconda3/envs/bakta/lib/python3.10/site-packages/bakta/batch.py", line 180, in main() File "/Users/cmeehan/opt/miniconda3/envs/bakta/lib/python3.10/site-packages/bakta/batch.py", line 86, in main annotate_aa(aas) File "/Users/cmeehan/opt/miniconda3/envs/bakta/lib/python3.10/site-packages/bakta/batch.py", line 141, in annotate_aa aas_psc, aas_pscc, aas_not_found = psc.search(aas_not_found) File "/Users/cmeehan/opt/miniconda3/envs/bakta/lib/python3.10/site-packages/bakta/psc.py", line 63, in search raise Exception(f'diamond error! error code: {proc.returncode}') Exception: diamond error! error code: 1

Any ideas?

Cheers, Conor

oschwengers commented 2 years ago

Hi Conor, thanks for reporting this. I think there are several things going wrong here. I added more checks and loggings from the bakta main app to the batch commando (https://github.com/oschwengers/bakta/commit/0ba32feeec58cc36948f75be513f66fab04f74cd). Could you please pull the latest commit, re-install bakta as suggested before and provide the error message from Diamond. stdout and stderr of Diamond should now be logged in an additional <prefix>.log file.

conmeehan commented 2 years ago

Hi,

Error log is attached.

Cheers, Con AccessoryGenome.log