qiime2 / q2-feature-classifier

QIIME 2 plugin supporting taxonomic classification
BSD 3-Clause "New" or "Revised" License
18 stars 38 forks source link

Find Consensus Annotation Parallelization #198

Closed Alexander-Jorjorian closed 4 months ago

Alexander-Jorjorian commented 7 months ago

Improvement Description

Have find_consensus_annotation process the blast6 formatted taxonomy hits from the classifiers in a parallelized batch manner.

Current Behavior

Currently, this function processes the entire taxonomy hit file in a single-threaded manner and holds all of the hits in memory.

Proposed Behavior

Instead, split the input into a list of data frames and process these in batches using parallelism. This will have two benefits: it will allow us to use multiple CPUs as we do for the blast functionality.

Questions

Are there any reasons why this approach is not viable?

Comments

I am currently working on implementing this functionality. The main benefit will be improved performance and stability on large, complex datasets.

lizgehret commented 4 months ago

Hey @Alexander-Jorjorian,

Thanks for submitting this! Would you mind sharing where you're at in this implementation process (pull request, etc)? Additionally, are you aware of Pipeline Parallelization? This was functionality that we added in QIIME 2 2023.5 that may support exactly what you're hoping to achieve (if you haven't already fully implemented this in another way).

ebolyen commented 4 months ago

I'm going to close this as we haven't heard back. But feel free to reopen at any point @Alexander-Jorjorian!