Nextflow: mmseqs2 handling

hoelzer-lab / hypro

Extend hypothetical prokka protein annotations using additional homology searches against larger databases

GNU General Public License v3.0

9 stars 0 forks source link

Closed hoelzer closed 3 years ago

hoelzer commented 3 years ago

@EvaFriederike I am not 100% sure yet how to handle the mmseqs2 step.

Currently, everything happens in one process. Here, the mmseqs.sh script is called that does the indexing of the db and then performs the run.

I think bc/ nextflow generates a new tmp working dir every time the indexing is run again and again.

I suggest to separate this:

before the actual mmseqs2 call introduce another process for indexing the database
- store the index along the raw FASTA db file in the database folder
- if already an index exists: skip this! For larger dbs this takes ages otherwise
then hand over the query FASTA and the indexed DB to the process that actually runs the mmseqs2 alignment
- the bash code to do this can actually directly go into the process, no need for an extra mmseqs2.sh

EvaFriederike commented 3 years ago

The mmseqs2 step is now split into one sub workflow and a process:

Workflow mmseqs2_dbs: The workflow checks whether the query db, target db and target index files exist already. The tar.gz files are then either loaded or built in the corresponding processes.
Process mmseqs2 The process takes the loaded dbs and indexed target db to run the search for the hyprots.

hoelzer commented 3 years ago

Why do we actually index the query and the target? :) Because we perform the mmseqs2 alignment in both directions?

EvaFriederike commented 3 years ago

In order for mmseqs2 to perform the search both the query and target FASTA files need to be converted into sequence databases.

createdb computes the sequence db for query and target. The output already includes index files containing amino acid serquences and FASTA headers.
createindex computes an additional pre-filter index file for the targetDB. This is recommended to use to speed-up the first step of mmseqs2 search when the targetDB is about to be used several times.

hoelzer commented 3 years ago

Ah, thanks for the info! I had in mind that blast-like-style only on index db of the target is needed.